[00:00:04] <jouncebot>	 yurik: Respected human, time to deploy Graphoid service deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T0000). Please do the needful.
[00:00:10] <Reedy>	 !lol deploying rand();
[00:00:27] <grrrit-wm>	 (03PS3) 10Dzahn: labs kvm ssl cert monitoring: fix it [puppet] - 10https://gerrit.wikimedia.org/r/249328 (https://phabricator.wikimedia.org/T116332) 
[00:01:25] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] labs kvm ssl cert monitoring: fix it [puppet] - 10https://gerrit.wikimedia.org/r/249328 (https://phabricator.wikimedia.org/T116332) (owner: 10Dzahn)
[00:02:44] <tgr>	 yurik: sorry, I'm overflowing the SWAT window
[00:02:59] <yurik>	 tgr, no worries, take your time, ping me when done
[00:06:33] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760278 (10Dzahn) after this. on labvirt1001, the plugin got created:  Notice: /Stage[main]/Openstack::Nova::Compute/File[/usr/local/lib/nagios/plugi...
[00:09:56] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors
[00:10:14] <mutante>	 yep, i'll check that
[00:10:20] <mutante>	 about to upload a change anyways
[00:11:18] <grrrit-wm>	 (03PS1) 10Dzahn: labs kvm ssl cert monitoring: fix nrpe command [puppet] - 10https://gerrit.wikimedia.org/r/249331 (https://phabricator.wikimedia.org/T116332) 
[00:11:22] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] labs kvm ssl cert monitoring: fix nrpe command [puppet] - 10https://gerrit.wikimedia.org/r/249331 (https://phabricator.wikimedia.org/T116332) (owner: 10Dzahn)
[00:11:35] <grrrit-wm>	 (03PS2) 10Dzahn: labs kvm ssl cert monitoring: fix nrpe command [puppet] - 10https://gerrit.wikimedia.org/r/249331 (https://phabricator.wikimedia.org/T116332) 
[00:12:11] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] labs kvm ssl cert monitoring: fix nrpe command [puppet] - 10https://gerrit.wikimedia.org/r/249331 (https://phabricator.wikimedia.org/T116332) (owner: 10Dzahn)
[00:15:34] <mutante>	 yep, gonna be fixed after next puppet run
[00:20:46] <icinga-wm>	 PROBLEM - HHVM rendering on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:22:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:23:36] <icinga-wm>	 RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:24:07] <icinga-wm>	 PROBLEM - DPKG on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:24:07] <icinga-wm>	 PROBLEM - puppet last run on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:24:19] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1095/" [puppet] - 10https://gerrit.wikimedia.org/r/249038 (owner: 10Dzahn)
[00:24:27] <grrrit-wm>	 (03PS3) 10Dzahn: mariadb: 32 lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249038 
[00:24:36] <icinga-wm>	 PROBLEM - nutcracker port on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:24:47] <icinga-wm>	 PROBLEM - configured eth on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:24:47] <icinga-wm>	 PROBLEM - RAID on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:24:56] <icinga-wm>	 PROBLEM - SSH on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:24:58] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:25:03] <yurik>	 tgr, still deploying?
[00:25:08] <icinga-wm>	 PROBLEM - salt-minion processes on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:25:09] <icinga-wm>	 PROBLEM - dhclient process on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:25:09] <icinga-wm>	 PROBLEM - HHVM processes on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:25:37] <icinga-wm>	 PROBLEM - nutcracker process on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:25:38] <icinga-wm>	 PROBLEM - Disk space on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:25:57] <Krenair>	 yurik, according to `w`, yes
[00:26:15] <yurik>	 Krenair, `w` ?
[00:26:20] <Krenair>	 the w command
[00:26:21] <Krenair>	 on tin
[00:26:33] <Krenair>	 shows you who is logged in and what they are doing
[00:26:44] <yurik>	 ah, good, going to check it out ))
[00:27:03] <yurik>	 safe to run it i presume )
[00:27:49] <Krenair>	 yes
[00:27:51] <yurik>	 ooo, such a sweet command, thx ))
[00:27:56] <mutante>	 !log powercycling unresponsive mw1127
[00:27:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:29:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:29:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:29:39] <icinga-wm>	 PROBLEM - Host mw1127 is DOWN: PING CRITICAL - Packet loss = 100%
[00:29:39] <icinga-wm>	 PROBLEM - RAID on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:29:49] <icinga-wm>	 PROBLEM - SSH on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:29:51] <icinga-wm>	 PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:29:58] <ori>	 hmmmmmm
[00:30:00] <icinga-wm>	 RECOVERY - nutcracker process on mw1127 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[00:30:09] <icinga-wm>	 PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:30:10] <icinga-wm>	 RECOVERY - Host mw1127 is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms
[00:30:19] <ori>	 mw1135 as well now?
[00:30:19] <icinga-wm>	 RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures
[00:30:29] <icinga-wm>	 RECOVERY - HHVM processes on mw1127 is OK: PROCS OK: 6 processes with command name hhvm
[00:30:30] <icinga-wm>	 RECOVERY - salt-minion processes on mw1127 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:30:39] <icinga-wm>	 RECOVERY - nutcracker port on mw1127 is OK: TCP OK - 0.000 second response time on port 11212
[00:30:40] <icinga-wm>	 PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:30:56] <wikibugs>	 6operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#1760346 (10bd808) >>! In T80395#1760088, @Dzahn wrote: > @demon  @bd808 what should we do with this ticket? reject? resolve once it uses etcd?   I'd vote for closing it when {T115899} or something similar is done. I'd gues...
[00:30:59] <icinga-wm>	 RECOVERY - RAID on mw1127 is OK: OK: no RAID installed
[00:30:59] <icinga-wm>	 RECOVERY - configured eth on mw1127 is OK: OK - interfaces up
[00:30:59] <icinga-wm>	 PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:31:09] <icinga-wm>	 RECOVERY - SSH on mw1127 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[00:31:15] <mutante>	 ori: yes, one seemed normal, 2 starts to be suspicious
[00:31:19] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct
[00:31:29] <icinga-wm>	 RECOVERY - dhclient process on mw1127 is OK: PROCS OK: 0 processes with command name dhclient
[00:31:29] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1127 is OK: OK: nf_conntrack is 4 % full
[00:31:35] <mutante>	 well, there's the icinga config fix too
[00:31:39] <icinga-wm>	 PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:31:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.067 second response time
[00:32:00] <icinga-wm>	 PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:32:10] <icinga-wm>	 RECOVERY - DPKG on mw1127 is OK: All packages OK
[00:32:20] <icinga-wm>	 PROBLEM - puppet last run on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:32:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 64983 bytes in 1.778 second response time
[00:32:28] <mutante>	 output of mw1135 console:
[00:32:29] <icinga-wm>	 RECOVERY - Disk space on mw1127 is OK: DISK OK
[00:32:42] <mutante>	 init: ssh main p
[00:32:43] <mutante>	 Ubuntu 14.04.2 LTS mw1135 ttyS1
[00:32:49] <icinga-wm>	 PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:33:03] <mutante>	 OOM killer
[00:33:13] <mutante>	 [27867198.620200] Out of memory: Kill process 9213 (hhvm) score 880 or sacrifice child
[00:33:16] <mutante>	 [27867198.627855] Killed process 9229 (hhvm) total-vm:346788kB, anon-rss:33344kB, file-rss:196kB
[00:33:19] <mutante>	 ori: ^
[00:33:32] * ori looks at app server memory usage on ganglia
[00:33:43] <Krenair>	 'or sacrifice child' o.O
[00:33:43] <tgr>	 yurik: yes, waiting for scap
[00:33:46] <mutante>	 i see that without logging in, on the login screen
[00:34:00] <icinga-wm>	 RECOVERY - Disk space on mw1135 is OK: DISK OK
[00:34:10] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:34:23] <mutante>	 predicts recoveries
[00:34:30] <icinga-wm>	 RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm
[00:34:30] <icinga-wm>	 RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:34:50] <icinga-wm>	 RECOVERY - configured eth on mw1135 is OK: OK - interfaces up
[00:35:05] <ori>	 nothing out of the ordinary in ganglia
[00:35:09] <icinga-wm>	 RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212
[00:35:09] <icinga-wm>	 RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient
[00:35:21] <mutante>	 !log mw1135 temp. unresponsive - OOM killer killing hhvm 
[00:35:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:35:30] <icinga-wm>	 RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[00:35:36] <ori>	 thanks mutante
[00:35:59] <icinga-wm>	 RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures
[00:36:00] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1135 is OK: OK: nf_conntrack is 0 % full
[00:36:09] <icinga-wm>	 RECOVERY - RAID on mw1135 is OK: OK: no RAID installed
[00:36:21] <icinga-wm>	 RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[00:36:42] <mutante>	 andrewbogott: fixed. https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=kvm+ssl
[00:37:01] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760353 (10Dzahn) works now:  https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=kvm+ssl
[00:37:04] <logmsgbot>	 !log tgr@tin Finished scap: Updating MediaViewer with r246112 (duration: 43m 18s)
[00:37:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:37:21] <wikibugs>	 6operations, 7HTTPS, 7Icinga, 7Monitoring: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1760355 (10Dzahn)
[00:37:22] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760354 (10Dzahn) 5Open>3Resolved
[00:37:49] <tgr>	 yurik: done, sorry for the wait
[00:37:57] <yurik>	 tgr, no worries )
[00:38:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.159 second response time
[00:39:00] <icinga-wm>	 RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 64974 bytes in 1.325 second response time
[00:39:12] <wikibugs>	 6operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#1760356 (10Dzahn) Sounds good, thank you @bd808. And i see that is already added as a blocker. Great.
[00:43:59] <grrrit-wm>	 (03PS3) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 
[00:44:31] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 (owner: 10Dzahn)
[00:45:03] <grrrit-wm>	 (03PS4) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 
[00:46:17] <grrrit-wm>	 (03PS5) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 
[00:49:15] <grrrit-wm>	 (03PS6) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 
[00:49:57] <grrrit-wm>	 (03CR) 10Dzahn: "made this smaller for easier review - these are all just to get to re-enable that check. not too many to fix globally" [puppet] - 10https://gerrit.wikimedia.org/r/243852 (owner: 10Dzahn)
[00:51:24] <grrrit-wm>	 (03PS2) 10Dzahn: lint: double quoted strings pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/243853 
[00:51:58] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] lint: double quoted strings pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/243853 (owner: 10Dzahn)
[00:53:12] <grrrit-wm>	 (03PS2) 10Dzahn: labs restbase: Add en.wikivoyage.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/249319 (owner: 10Alex Monk)
[00:54:20] <grrrit-wm>	 (03CR) 10Dzahn: "uhm.. "No file(s) found for import of '../../../manifests/nagios.pp" ??" [puppet] - 10https://gerrit.wikimedia.org/r/243853 (owner: 10Dzahn)
[00:55:18] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] labs restbase: Add en.wikivoyage.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/249319 (owner: 10Alex Monk)
[00:55:25] <yurik>	 !log deployed graphoid - https://gerrit.wikimedia.org/r/#/c/249324/
[00:55:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:07:05] <grrrit-wm>	 (03CR) 10Hydriz: [C: 031] copy pagecounts-al-sites files over to labs from datasets [puppet] - 10https://gerrit.wikimedia.org/r/249175 (https://phabricator.wikimedia.org/T93317) (owner: 10ArielGlenn)
[02:34:31] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 33s)
[02:34:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:39:23] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-28 02:39:23+00:00
[02:39:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:00:34] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.4/cache/l10n: l10nupdate for 1.27.0-wmf.4 (duration: 06m 00s)
[03:00:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:03:34] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.4) at 2015-10-28 03:03:34+00:00
[03:03:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:50:35] <grrrit-wm>	 (03PS1) 10Dzahn: openstack: add links to docs for components, lint [puppet] - 10https://gerrit.wikimedia.org/r/249342 
[03:53:18] <grrrit-wm>	 (03PS2) 10Dzahn: openstack: add links to docs for components, lint [puppet] - 10https://gerrit.wikimedia.org/r/249342 
[04:01:36] <grrrit-wm>	 (03CR) 10Dzahn: "done here, to fix compiler run:" [puppet] - 10https://gerrit.wikimedia.org/r/247217 (owner: 10Muehlenhoff)
[04:04:40] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "rebuilt - compiles as noop now" [puppet] - 10https://gerrit.wikimedia.org/r/247217 (owner: 10Muehlenhoff)
[04:14:11] <grrrit-wm>	 (03PS1) 10Dzahn: interface: some lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249344 
[04:15:36] <grrrit-wm>	 (03CR) 10Dzahn: "@ori feel like taking this? this is for mediawiki module what we did for most (all?) misc services already" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza)
[04:26:14] <grrrit-wm>	 (03PS1) 10Dzahn: move mw jobqueue monitoring class out of misc [puppet] - 10https://gerrit.wikimedia.org/r/249345 
[04:32:32] <grrrit-wm>	 (03PS2) 10Dzahn: move mw jobqueue monitoring class out of misc [puppet] - 10https://gerrit.wikimedia.org/r/249345 
[04:33:04] <grrrit-wm>	 (03PS3) 10Dzahn: move mw jobqueue monitoring class out of misc [puppet] - 10https://gerrit.wikimedia.org/r/249345 
[04:43:34] <grrrit-wm>	 (03PS1) 10Dzahn: kill misc/fundraising.pp, move to role logging [puppet] - 10https://gerrit.wikimedia.org/r/249347 
[04:44:39] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:46:30] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:48:08] <grrrit-wm>	 (03PS2) 10Dzahn: kill misc/fundraising.pp, move to role logging [puppet] - 10https://gerrit.wikimedia.org/r/249347 
[04:48:10] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[04:48:19] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:48:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[04:49:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[04:53:52] <wikibugs>	 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1760874 (10Dzahn) yep, you are right. i don't know why i thought it was a duplicate, it's not.
[04:56:50] <wikibugs>	 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1760878 (10mmodell) 5Open>3declined a:3mmodell fatalmonitor is a very rudimentary tool. #scap3 should include such information (See {T110068}), but I don't think fatalmonitor is a good place to add such fu...
[04:59:10] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:59:10] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:59:10] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:59:10] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:01:24] <wikibugs>	 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1760888 (10Dzahn) 3NEW
[05:02:35] <wikibugs>	 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1760895 (10Dzahn)
[05:02:49] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[05:02:49] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy
[05:02:50] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[05:02:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:08:10] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:08:11] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:08:11] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:08:11] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:08:11] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:09:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[05:09:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[05:11:13] <wikibugs>	 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1760913 (10Legoktm) 5declined>3Open Until scap3 is actually in use, this is a valid feature enhancement request for fatalmonitor. If/once we're no longer using fatalmonitor, this task can be declined.
[05:11:30] <wikibugs>	 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1760915 (10Legoktm) a:5mmodell>3None
[05:15:14] <legoktm>	 !log ran mwscript updateSpecialPages.php --wiki=testwiki --only=GadgetUsage on terbium
[05:15:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:15:20] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:15:20] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:22:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[05:22:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy
[05:22:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[05:22:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[05:22:31] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[05:27:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:28:00] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:28:00] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:28:00] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:29:49] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[05:31:40] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:35:19] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:36:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy
[05:37:00] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[05:38:49] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[05:38:49] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[05:44:10] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:44:10] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:44:10] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:45:50] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[05:47:49] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:51:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[05:51:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[05:51:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[05:51:20] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:51:20] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[05:56:40] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:56:40] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:56:41] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:56:41] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:58:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[05:58:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[06:04:00] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy
[06:04:00] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[06:04:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[06:04:09] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:04:09] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:09:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[06:09:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[06:09:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[06:09:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[06:09:30] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:11:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy
[06:14:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:14:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:14:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:14:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:16:40] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[06:16:42] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[06:16:42] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[06:16:42] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:22:06] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Oct 28 06:22:06 UTC 2015 (duration 22m 5s)
[06:22:09] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:22:09] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:22:09] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:22:09] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:22:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:25:31] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[06:25:31] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[06:25:31] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[06:25:39] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[06:25:39] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[06:25:39] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[06:28:58] <icinga-wm>	 PROBLEM - High load average on labstore1002 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0]
[06:29:30] <icinga-wm>	 PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: puppet fail
[06:30:30] <icinga-wm>	 PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:40] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:41] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:40] <icinga-wm>	 PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:59] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:20] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:30] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:41] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:09] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:10] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:33:40] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:34:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:34:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:34:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:34:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:34:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:35:01] <icinga-wm>	 RECOVERY - High load average on labstore1002 is OK: OK: Less than 50.00% above the threshold [16.0]
[06:43:41] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[06:43:41] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[06:43:49] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:45:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[06:47:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[06:47:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[06:49:09] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:50:51] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:52:41] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:56:40] <icinga-wm>	 RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[06:56:49] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[06:56:51] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:11] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[06:57:20] <icinga-wm>	 RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[06:57:30] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:30] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[06:57:31] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[06:57:41] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:01] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:58:01] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:58:01] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[06:58:10] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[06:58:11] <icinga-wm>	 RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[06:59:51] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[07:00:32] <wikibugs>	 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown, 10Wikidata: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1761030 (10Addshore)
[07:02:17] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1761034 (10Addshore)
[07:02:36] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1758419 (10Addshore) Amended per @Krenair
[07:03:31] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[07:05:20] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:08:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[07:08:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[07:09:00] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[07:13:00] <icinga-wm>	 PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail
[07:14:29] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:16:04] <wikibugs>	 6operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1761047 (10Joe) Hi,  I don't think we really need 6 videoscalers, or at least I don't see a compelling reason for that given:  1) The current videoscalers are already...
[07:16:10] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:19:49] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:21:01] <wikibugs>	 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1761048 (10mmodell) @legoktm: scap3 is in use and T110068 should be done in the near future.  Have you looked at the code for fatalmonitor? It's a series of unix commands piped together and wrapped in `watch` -...
[07:21:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[07:21:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[07:21:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[07:21:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[07:21:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[07:24:55] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment: move to role module [puppet] - 10https://gerrit.wikimedia.org/r/249090 (owner: 10Giuseppe Lavagetto)
[07:26:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:26:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:26:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:26:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:31:39] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "1) We did load the apache 2.4 module that allow using 2.2 syntax so this just adds complexity with no gain, AFAICT" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza)
[07:31:50] <icinga-wm>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures
[07:32:30] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:32:30] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:33:45] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment::server: reorganize code [puppet] - 10https://gerrit.wikimedia.org/r/249091 (owner: 10Giuseppe Lavagetto)
[07:38:30] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] "noop according to the puppet compiler" [puppet] - 10https://gerrit.wikimedia.org/r/249092 (owner: 10Giuseppe Lavagetto)
[07:40:00] <icinga-wm>	 RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:42:34] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] deployment::mediawiki: rename wikitech::wiki::password class [puppet] - 10https://gerrit.wikimedia.org/r/249093 (owner: 10Giuseppe Lavagetto)
[07:44:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[07:44:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[07:44:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[07:44:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[07:44:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[07:45:00] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[07:45:55] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment: remove test role [puppet] - 10https://gerrit.wikimedia.org/r/249094 (owner: 10Giuseppe Lavagetto)
[07:46:16] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment::server: drop mod_dav [puppet] - 10https://gerrit.wikimedia.org/r/249102 (owner: 10Giuseppe Lavagetto)
[07:54:20] <icinga-wm>	 PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[07:54:50] <icinga-wm>	 PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[07:55:41] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:04:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:04:51] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:04:51] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:06:39] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:08:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[08:08:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[08:11:29] <wikibugs>	 6operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1761083 (10brion) @Joe I'm planning to re-run all the Ogg transcodes for improved quality and to fix a bunch of old ones that broke; this will eat all the CPU time for...
[08:12:00] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[08:13:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:13:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:17:29] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:19:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[08:19:10] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[08:19:10] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[08:19:10] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[08:24:30] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:24:31] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:24:31] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:24:31] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:28:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[08:28:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[08:29:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[08:31:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[08:33:41] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:33:41] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:35:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[08:35:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[08:35:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[08:40:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:40:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:40:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:40:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:40:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:40:55] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: r::mw::maintenance: include role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/249108 (https://phabricator.wikimedia.org/T116728) 
[08:43:22] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] r::mw::maintenance: include role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/249108 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto)
[08:44:29] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:46:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[08:46:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[08:46:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[08:46:10] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[08:46:10] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[08:46:10] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[08:46:56] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: group general monitoring scripts in a single role [puppet] - 10https://gerrit.wikimedia.org/r/249109 
[08:51:00] <wikibugs>	 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1761129 (10MoritzMuehlenhoff) I had checked the status of a few long-time depooled mw* servers (and re-enabled a few) when I made the flip to enable ferm.   From my IRC logs Ori confirmed that mw1169 can be re-poo...
[08:51:31] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:51:31] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:51:51] <moritzm>	 _joe_: ^ can you amend T116256 with the status for mw1161?
[08:52:24] <_joe_>	 moritzm: uh? I have no idea I guess?
[08:53:05] <_joe_>	 if there is an hardware ticket I opened, I guess the status is what chris stated there
[08:53:15] <moritzm>	 ah, sorry. I thought "retroactive commit by joe" referred to your own changes
[08:53:21] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:53:21] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:53:21] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:53:21] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:53:22] <_joe_>	 nope
[08:53:33] <_joe_>	 I found it depooled with no commit, probably
[08:54:14] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: group general monitoring scripts in a single role [puppet] - 10https://gerrit.wikimedia.org/r/249109 
[08:54:15] <_joe_>	 but I can take a look, yes
[08:55:03] <moritzm>	 there's already a ticket by Rob for that, it's probably just a case of someone forgot to re-pool it
[08:55:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[08:55:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[08:55:10] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[08:55:14] <moritzm>	 can't find a Phab ticket for hardware problems
[08:55:26] <_joe_>	 I tend to err on the side of caution
[08:55:44] <_joe_>	 btw, rob has also a ticket to have 6 videoscalers, which I just shot down
[08:56:06] <_joe_>	 well, brion created it, and I think it is a bit of an overkill
[08:56:40] <jynus>	 thanks thanks thanks
[08:56:46] <moritzm>	 indeed
[08:56:49] <jynus>	 I thought I was going crazy
[08:57:21] <_joe_>	 we can go with 3 as we are now (1.5x what we usually had) or with 4 (2x) if we really need it
[08:58:14] <_joe_>	 jynus: regarding what?
[08:58:49] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[08:58:49] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[08:58:51] <grrrit-wm>	 (03PS4) 10Giuseppe Lavagetto: mediawiki: group general monitoring scripts in a single role [puppet] - 10https://gerrit.wikimedia.org/r/249109 
[08:59:49] <jynus>	 sorry, I got too excited: post-commit hook on palladium not updating strotium adecuatelly
[08:59:55] <jynus>	 giving random puppet errors
[09:00:22] <jynus>	 I will create a ticket, but at least I identified it
[09:00:38] <jynus>	 which is good because it is actuable
[09:00:40] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:40] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:00:52] <_joe_>	 jynus: sometimes that has happened in the past - we failed to fix it though
[09:01:06] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: group general monitoring scripts in a single role [puppet] - 10https://gerrit.wikimedia.org/r/249109 (owner: 10Giuseppe Lavagetto)
[09:01:10] <jynus>	 it is ok, I can puppet-merge on strontium
[09:01:39] <jynus>	 but I was getting confused because the simplest of changes failed
[09:02:09] <jynus>	 as I said, knowing the why is 90% of the problem solving
[09:02:20] <jynus>	 documting a problem is an ok solution
[09:02:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[09:02:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[09:02:52] <jynus>	 what I feel is that after so many months I continue discovering things
[09:03:49] <brion>	 _joe_: so do I have to start the batch recompression first to demonstrate need for CPU time or something?
[09:04:19] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:04:19] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:04:19] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:07:51] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:07:51] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:08:29] <jynus>	 Is there a reason to have the mariadb module on its own repo instead of operations/puppet?
[09:09:18] <jynus>	 the same way I do not think there is a reason to have wikimedia-mariadb deb repo
[09:09:43] <_joe_>	 jynus: you might want to ask ottomata about the separate git repo
[09:09:50] <jynus>	 ok, thanks
[09:10:14] <jynus>	 but that is strange- he only recently started using that class
[09:11:10] <jynus>	 also, is should not be named mariadb, I suppose it only had that name to differenciate from the original class
[09:11:16] <_joe_>	 jynus: the reason might be other projects used that puppet class
[09:11:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[09:12:08] <jynus>	 but it still gets merged into the mariadb module on the same puppet place :-/
[09:12:45] <jynus>	 please, ignore me _joe_ and keep doing the great work you are doing, do not let me bother you
[09:12:57] <jynus>	 :-)
[09:15:41] <_joe_>	 !log preparing to reimage mw1152, disabling puppet, scheduling downtime.
[09:15:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:16:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:29:39] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[09:33:20] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[09:33:20] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[09:33:20] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[09:33:20] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[09:33:21] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[09:38:51] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:38:51] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:43:02] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/248960 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans)
[09:43:09] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/248960 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans)
[09:43:28] <moritzm>	 mobrovac, godog: shall we start? I'll disable puppet on restbase100[1-6]
[09:43:42] <godog>	 moritzm: +1
[09:43:51] <mobrovac>	 kk
[09:43:56] <godog>	 moritzm: let me force a run first actually
[09:44:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[09:44:20] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:44:21] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:44:34] <moritzm>	 godog: ok, I'll reenable puppet then
[09:44:58] <moritzm>	 done
[09:46:30] <godog>	 moritzm: ok! good to go
[09:47:55] <moritzm>	 wrt "disable Cassandra and RESTBase on boot on all boxes", do they use systemd units? is that a simple "systemctl disable foo.unit"?
[09:48:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[09:48:52] <godog>	 moritzm: yep all systemd
[09:49:10] <godog>	 moritzm: actually no, cassandra is systemd but restbase isn't
[09:49:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:49:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:51:40] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy
[09:53:40] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:53:40] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:55:29] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[09:55:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[09:56:44] <godog>	 I've silenced those btw
[09:58:37] <mobrovac>	 godog: moritzm: rb's got a sysV script, but it's managed with systemd
[09:59:02] <mobrovac>	 also, please depool each server before rebooting
[09:59:19] <mobrovac>	 we already have a high enough 5xx rate as it is
[09:59:31] <mobrovac>	 godog: that rb1007 seems to be really problematic
[10:00:25] <moritzm>	 what do you mean by "managed with systemd", there's no unit file for it?
[10:00:52] <mobrovac>	 no
[10:01:00] <mobrovac>	 systemd is using the initV script directly
[10:03:42] <grrrit-wm>	 (03PS1) 10Jcrespo: mariadb: update submodule in production repo [puppet] - 10https://gerrit.wikimedia.org/r/249365 
[10:05:24] <grrrit-wm>	 (03Abandoned) 10Jcrespo: mariadb: update submodule in production repo [puppet] - 10https://gerrit.wikimedia.org/r/249365 (owner: 10Jcrespo)
[10:06:20] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: mw1152: convert to be the HAT maintenance host [puppet] - 10https://gerrit.wikimedia.org/r/249110 (https://phabricator.wikimedia.org/T116728) 
[10:06:22] <grrrit-wm>	 (03PS1) 10Mobrovac: Labs: Parsoid Cache: Use new IP address for deployment-parsoidcache02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249366 (https://phabricator.wikimedia.org/T103660) 
[10:06:43] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] mw1152: convert to be the HAT maintenance host [puppet] - 10https://gerrit.wikimedia.org/r/249110 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto)
[10:08:34] <moritzm>	 mobrovac: ok, restbase startup also disabled
[10:08:39] <wikibugs>	 6operations, 6Analytics-Engineering, 10Beta-Cluster-Infrastructure, 7Varnish: On beta cluster varnish stats process points to production statsd - https://phabricator.wikimedia.org/T116898#1761231 (10hashar) 3NEW
[10:09:30] <mobrovac>	 kk moritzm
[10:10:09] <moritzm>	 mobrovac, godog: any special order needed/preferred, or all they all alike?
[10:10:22] <mobrovac>	 all the same
[10:10:33] <moritzm>	 I'll start by depooling 1001, then
[10:11:09] <_joe_>	 !log reimaging mw1152
[10:11:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:14:27] <grrrit-wm>	 (03PS1) 10Jcrespo: Merge changes in the mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/249369 
[10:15:15] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Merge changes in the mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/249369 (owner: 10Jcrespo)
[10:17:51] <wikibugs>	 6operations, 10RESTBase-Cassandra: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1761257 (10fgiunchedi) judging from strace timing it seems mobileapps endpoint is slow, in the ~2s range and sometimes takes more than the `service_checker` timeout to reply (5s). running...
[10:19:28] <godog>	 mobrovac: ^ thoughts? (or anyone else really)
[10:20:48] <mobrovac>	 godog: inexplicably, this seems to correlate with rb1007 having latency problems
[10:21:00] <mobrovac>	 but, one node shouldn't matter
[10:21:26] <moritzm>	 godog: rb1001 would be ready to boot into the new kernel, you keeping an eye on the effect of https://gerrit.wikimedia.org/r/#/c/248960/ ?
[10:21:36] <godog>	 moritzm: yup
[10:21:55] <moritzm>	 k, rebooting 
[10:22:11] <godog>	 mobrovac: true it shouldn't, when we are done with the reboots we can try taking 1007 out of the cluster and see if that changes things
[10:22:39] <mobrovac>	 good idea godog!
[10:22:49] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: Timeout while attempting connection
[10:22:55] <grrrit-wm>	 (03PS2) 10Hashar: beta: use new IP for Parsoid Cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249366 (https://phabricator.wikimedia.org/T103660) (owner: 10Mobrovac)
[10:23:50] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] "I fixed up the comments to refer to deployment-cache-parsoid05" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249366 (https://phabricator.wikimedia.org/T103660) (owner: 10Mobrovac)
[10:23:56] <grrrit-wm>	 (03Merged) 10jenkins-bot: beta: use new IP for Parsoid Cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249366 (https://phabricator.wikimedia.org/T103660) (owner: 10Mobrovac)
[10:28:58] <wikibugs>	 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: better cassandra process checks - https://phabricator.wikimedia.org/T108306#1761279 (10fgiunchedi) 5Open>3Resolved resolved by https://gerrit.wikimedia.org/r/#/c/249082/
[10:29:26] <grrrit-wm>	 (03Abandoned) 10Filippo Giunchedi: cassandra: switch to nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/230066 (https://phabricator.wikimedia.org/T108306) (owner: 10Filippo Giunchedi)
[10:37:13] <_joe_>	 !log manually removed crontab from mw1152, erroneously created by puppet
[10:37:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:39:10] <moritzm>	 !log updated kernel on restbase1001 to latest 3.19
[10:39:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:40:48] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/).
[10:45:47] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy
[10:49:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy
[10:50:40] <grrrit-wm>	 (03PS1) 10Jcrespo: Enabling performance schema experimentally on db1022 [puppet] - 10https://gerrit.wikimedia.org/r/249372 (https://phabricator.wikimedia.org/T99485) 
[10:56:14] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: don't enable at boot [puppet] - 10https://gerrit.wikimedia.org/r/249374 
[11:01:27] <grrrit-wm>	 (03PS1) 10KartikMistry: Apertium: Add missing apertium-br-fr [puppet] - 10https://gerrit.wikimedia.org/r/249376 (https://phabricator.wikimedia.org/T102101) 
[11:01:42] <grrrit-wm>	 (03PS2) 10Jcrespo: Enabling performance schema experimentally on db1022 [puppet] - 10https://gerrit.wikimedia.org/r/249372 (https://phabricator.wikimedia.org/T99485) 
[11:02:12] <kart_>	 akosiaris: ^^small review for you :)
[11:03:19] <moritzm>	 kart_: today's a public holiday in Greece
[11:03:24] <kart_>	 oh.
[11:03:32] <kart_>	 not urgent.
[11:04:01] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Enabling performance schema experimentally on db1022 [puppet] - 10https://gerrit.wikimedia.org/r/249372 (https://phabricator.wikimedia.org/T99485) (owner: 10Jcrespo)
[11:06:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy
[11:08:25] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] cassandra: don't enable at boot [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi)
[11:12:34] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "sadly this doesn't seem to work when tested on in labs, puppet still wants to ensure the service is running, testing some more.." [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi)
[11:13:23] <wikibugs>	 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1761393 (10SBisson) What is blocking this ticket?
[11:15:36] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: maintenance: amend hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/249377 
[11:16:13] <_joe_>	 godog: I think we had some logic for that
[11:16:19] <_joe_>	 in base::service_unit
[11:16:48] <_joe_>	 but the problem is that puppet either manages the state of a service, or cannot think it is present.
[11:18:05] <grrrit-wm>	 (03PS14) 10Madhuvishy: burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) 
[11:18:07] <grrrit-wm>	 (03CR) 10Madhuvishy: "Thanks for the review, Ori!" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy)
[11:18:19] <icinga-wm>	 PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 4 failures
[11:19:05] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] maintenance: amend hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/249377 (owner: 10Giuseppe Lavagetto)
[11:19:47] <grrrit-wm>	 (03PS1) 10Joal: Correct pageview to dumps synchro [puppet] - 10https://gerrit.wikimedia.org/r/249378 
[11:19:57] <godog>	 _joe_: indeed, so in this case it'd be declare_service => false to have the service available but not (re)started by puppet nor at boot
[11:20:14] <_joe_>	 yep
[11:21:38] <godog>	 I'll fix the docs for service_unit too
[11:21:51] <icinga-wm>	 RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[11:22:01] <_joe_>	 godog: can you verify that works?
[11:23:22] <godog>	 I can
[11:24:21] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] "This has been on my radar too. AFAIK, all of that (and all of the remaining nfs.pp) is about to be deprecated *very soon* now, as in this " [puppet] - 10https://gerrit.wikimedia.org/r/249347 (owner: 10Dzahn)
[11:28:50] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: stop declaring service resource [puppet] - 10https://gerrit.wikimedia.org/r/249374 
[11:29:46] <wikibugs>	 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1761440 (10hashar) Holy hell, how do you manage to install servers so fast ? :-}
[11:31:45] <wikibugs>	 6operations, 7Database: Adapt wmf-mariadb10 package for jessie or puppetize differently its service to adapt it to systemd - https://phabricator.wikimedia.org/T116903#1761444 (10jcrespo) 3NEW a:3jcrespo
[11:33:27] <godog>	 moritzm mobrovac https://gerrit.wikimedia.org/r/#/c/249374/2 this should do it for cassandra
[11:34:43] <mobrovac>	 godog: haven't we established ensure => present is not valid?
[11:34:56] <wikibugs>	 6operations, 7Database: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#1761462 (10jcrespo) I've filed T116903 and T116902 for the reminding tasks. I have already filed the Puppet-mariadb issues.  T105879 is no longer an issue, so I will repool the server now and close thi...
[11:35:01] <icinga-wm>	 PROBLEM - service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[11:35:15] <mobrovac>	 wtf?
[11:35:47] <mobrovac>	 godog: cass died on rb1007 ^^
[11:35:58] <grrrit-wm>	 (03PS1) 10Jcrespo: Repooling db1022 after checking its integrity and config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249379 (https://phabricator.wikimedia.org/T101516) 
[11:36:00] <icinga-wm>	 PROBLEM - Cassandra CQL query interface on restbase1007 is CRITICAL: Connection refused
[11:36:22] <godog>	 mobrovac: indeed, likely the same as yesterday
[11:38:26] <wikibugs>	 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1761469 (10hashar)
[11:38:34] <grrrit-wm>	 (03PS2) 10Jcrespo: Repooling db1022 after checking its integrity and config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249379 (https://phabricator.wikimedia.org/T101516) 
[11:39:08] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Repooling db1022 after checking its integrity and config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249379 (https://phabricator.wikimedia.org/T101516) (owner: 10Jcrespo)
[11:41:09] <wikibugs>	 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1761472 (10hashar) Need to get rid of the Gerrit replication ( T86661 )  We will need to make sure all slaves (gallium.wikimedia.org and instances in contintcloud and inte...
[11:41:13] <jynus>	 mobrovac, ping me when you are less busy
[11:41:22] <grrrit-wm>	 (03CR) 10Muehlenhoff: "I think that could work, but I'm unsure what to expect from systems where puppet currently manages the service, does that have an effect o" [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi)
[11:41:28] <mobrovac>	 heh kk jynus
[11:41:33] <mobrovac>	 hoep that'll happen today :)
[11:41:40] <jynus>	 it is about a tin commit
[11:42:15] <jynus>	 oh, it is beta, so probably doesn't matter
[11:42:31] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.
[11:42:48] <mobrovac>	 jynus: which one?
[11:42:58] <jynus>	 beta: use new IP for Parsoid Cache
[11:43:07] <hashar>	 jynus: oops sorry
[11:43:11] <hashar>	 I forgot to deploy that one on prod
[11:43:15] <hashar>	 it is harmless for prod 
[11:43:16] <jynus>	 i've merged it
[11:43:19] <hashar>	 sorry :-/
[11:43:29] <jynus>	 so I suppose I do not need to deploy it :-)
[11:43:38] <jynus>	 there was no problem at all
[11:44:00] <grrrit-wm>	 (03CR) 10Mobrovac: cassandra: stop declaring service resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi)
[11:44:25] <mobrovac>	 :)
[11:46:04] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1022 after maintenance (duration: 00m 19s)
[11:46:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:46:10] <icinga-wm>	 RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[11:52:27] <mobrovac>	 godog: should i restart cass on rb1007?
[11:53:53] <godog>	 mobrovac: heh I was looking at the logs but not a lot of luck, sure go ahead
[11:54:10] <mobrovac>	 yeah, me too, no luck
[11:54:35] <mobrovac>	 godog: i think we'll have to look into it after we finsh the kernel upgrade
[11:54:44] <urandom>	 is it another OOM?
[11:55:07] <wikibugs>	 6operations, 7Database: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#1761490 (10jcrespo) 5stalled>3Resolved
[11:55:22] <urandom>	 it is
[11:55:34] <godog>	 mobrovac: *nod*
[11:55:38] <mobrovac>	 urandom: still awake?
[11:55:39] <godog>	 urandom: yep :|
[11:56:51] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "@Moritz, afaict it will stop managing the service from puppet's POV but leave it alone otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi)
[11:57:10] <icinga-wm>	 RECOVERY - service on restbase1007 is OK: OK - cassandra is active
[11:57:24] <urandom>	 godog, mobrovac: i'm moving that heap dump into a named subdirectory of my home
[11:57:35] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: cassandra: stop declaring service resource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi)
[11:57:40] <urandom>	 since the PID isn't being expanded, it'll be overwritten if not renamed
[11:58:09] <godog>	 urandom: yeah mobrovac pointed me to the phab ticket earlier about %p :|
[12:00:01] <icinga-wm>	 RECOVERY - Cassandra CQL query interface on restbase1007 is OK: TCP OK - 0.004 second response time on port 9042
[12:00:06] <wikibugs>	 6operations, 7Database, 5Patch-For-Review: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1761498 (10jcrespo) When setting db1022 on production, we lost some accounts and hosts:   ``` mysql> SHOW GLOBAL STATUS like 'performance%'; +---------------------------------------...
[12:02:41] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] "As discussed on IRC, let's try this on one of the restbase hosts with puppet disabled on the others" [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi)
[12:03:26] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: cassandra: stop declaring service resource [puppet] - 10https://gerrit.wikimedia.org/r/249374 
[12:03:32] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: stop declaring service resource [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi)
[12:03:44] <wikibugs>	 6operations, 7Database, 5Patch-For-Review: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1761502 (10jcrespo) By the way, this is a good guide to tune performance_schema: http://marcalff.blogspot.com.es/2013/04/on-configuring-performance-schema.html
[12:06:41] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "will need to change references to Service['cassandra'] too" [puppet] - 10https://gerrit.wikimedia.org/r/249374 (owner: 10Filippo Giunchedi)
[12:06:43] <godog>	 there will be some puppet failures coming up for restbase machines btw
[12:08:20] <icinga-wm>	 PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: puppet fail
[12:10:00] <icinga-wm>	 PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: puppet fail
[12:11:31] <godog>	 ok I'm going to revert https://gerrit.wikimedia.org/r/249374
[12:13:21] <icinga-wm>	 PROBLEM - puppet last run on restbase2005 is CRITICAL: CRITICAL: puppet fail
[12:13:47] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: Revert "cassandra: stop declaring service resource" [puppet] - 10https://gerrit.wikimedia.org/r/249382 
[12:13:51] <mobrovac>	 it seems we are going to spend the day on this :)
[12:14:19] <mobrovac>	 "just a simple kernel upgrade" <- famous last words of the day
[12:14:37] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "cassandra: stop declaring service resource" [puppet] - 10https://gerrit.wikimedia.org/r/249382 (owner: 10Filippo Giunchedi)
[12:14:40] <mobrovac>	 the punch is that the kernel upgrade itself is fine
[12:15:30] <icinga-wm>	 PROBLEM - puppet last run on aqs1003 is CRITICAL: CRITICAL: puppet fail
[12:18:59] <icinga-wm>	 PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: puppet fail
[12:19:26] <godog>	 mobrovac: hehe, anyways I think we should move on with the kernel upgrade at least, keep puppet disabled on the affected machines and then tackle the service management
[12:19:37] <godog>	 cc moritzm ^
[12:19:40] <grrrit-wm>	 (03PS1) 10Jcrespo: Increase the host and account size on P_S config [puppet] - 10https://gerrit.wikimedia.org/r/249385 
[12:20:17] <grrrit-wm>	 (03CR) 10Alex Monk: "tin and terbium use apache 2.2.22... I don't think they run MW though. tin has some sort of deployment stuff (trebuchet?) and terbium runs" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza)
[12:20:22] <mobrovac>	 godog: moritzm: sounds good to me
[12:23:06] <jynus>	 there seems to be some issue with puppet on stat* I will check it after lunch
[12:24:46] <wikibugs>	 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1761540 (10fgiunchedi) afaict our puppet hooks for jessie does include `thirdparty`  ```     package_builder::pbuilder...
[12:25:29] <icinga-wm>	 PROBLEM - puppet last run on restbase1004 is CRITICAL: CRITICAL: puppet fail
[12:26:10] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:27:19] <icinga-wm>	 RECOVERY - puppet last run on restbase1004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[12:28:30] <icinga-wm>	 RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[12:29:12] <moritzm>	 godog: ok, but puppet was removed on rb1001 already?
[12:29:22] <moritzm>	 disabled, not removed
[12:30:01] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:30:11] <moritzm>	 I'll repool rb1001 now if we don't need it for further debugging
[12:31:12] <godog>	 moritzm: true I was expecting it to be disabled on 1001 
[12:32:11] <moritzm>	 godog: ok, so mean to proceed with 100[2-6] despite the fact that cassandra/rb get restarted upon reboot (which worked fine for 1001 anyway)?
[12:33:04] <godog>	 moritzm: yep, even though with puppet disabled and systemctl disable cassandra it shouldn't really start anything, btw puppet does seem to be enabled atm on 1002 for example
[12:35:39] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy
[12:36:52] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: service_checker: correctly return an error in case of timeouts [puppet] - 10https://gerrit.wikimedia.org/r/249388 (https://phabricator.wikimedia.org/T116739) 
[12:37:50] <icinga-wm>	 RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:38:55] <_joe_>	 godog: ^^
[12:39:34] <moritzm>	 godog: ok, I had it disabled via salt initially (and also got the "True" output from salt) (before it was re-enabled it for your final puppet run). I just made a second run to disable it, but got not output from the salt run, I'll check that locally on the systems
[12:39:52] <moritzm>	 ok, but I'll proceed with 100[2-6], then
[12:40:01] <grrrit-wm>	 (03PS15) 10Madhuvishy: burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) 
[12:40:47] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] service_checker: correctly return an error in case of timeouts [puppet] - 10https://gerrit.wikimedia.org/r/249388 (https://phabricator.wikimedia.org/T116739) (owner: 10Giuseppe Lavagetto)
[12:40:56] <godog>	 moritzm: ack
[12:41:20] <icinga-wm>	 RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[12:43:07] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: service_checker: correctly return an error in case of timeouts [puppet] - 10https://gerrit.wikimedia.org/r/249388 (https://phabricator.wikimedia.org/T116739) 
[12:43:20] <icinga-wm>	 RECOVERY - puppet last run on aqs1003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[12:46:27] <grrrit-wm>	 (03PS1) 10Hashar: contint: set Zuul URL based on server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) 
[12:46:50] <icinga-wm>	 RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:47:10] <grrrit-wm>	 (03CR) 10Hashar: "Impacts gallium.wikimedia.org . Once applied, the zuul-merger process needs to be restarted." [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[12:47:21] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] service_checker: correctly return an error in case of timeouts [puppet] - 10https://gerrit.wikimedia.org/r/249388 (https://phabricator.wikimedia.org/T116739) (owner: 10Giuseppe Lavagetto)
[12:47:45] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] service_checker: correctly return an error in case of timeouts [puppet] - 10https://gerrit.wikimedia.org/r/249388 (https://phabricator.wikimedia.org/T116739) (owner: 10Giuseppe Lavagetto)
[12:48:49] <mobrovac>	 moritzm: i still think it'd be advisable to use systemctl mask
[12:54:49] <moritzm>	 mobrovac, godog: I'm fine either way, we can certainly also use mask
[12:54:50] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:56:46] <moritzm>	 !log depooled restbase1002 for kernel update
[12:56:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:00:22] <hashar>	 could use a config change for Gerrit  to stop replication to gallium (the CI host) https://gerrit.wikimedia.org/r/#/c/244498/
[13:00:37] <hashar>	 the replication is no more needed  and I would like to clean it up from the server to unblock some other task
[13:01:01] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase2006 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[13:01:06] <wikibugs>	 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1761603 (10Joe) 5Open>3Resolved a:3Joe
[13:01:23] <wikibugs>	 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1756819 (10Joe)
[13:01:24] <wikibugs>	 6operations, 10RESTBase-Cassandra, 7Monitoring: service_checker reports success even on endpoints timing out - https://phabricator.wikimedia.org/T116770#1761607 (10Joe) 5Open>3Resolved a:3Joe
[13:02:53] <grrrit-wm>	 (03CR) 10Hashar: "Puppet compiler is happy: https://puppet-compiler.wmflabs.org/1111/gallium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[13:03:11] <icinga-wm>	 PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[13:04:03] <_joe_>	 uhm this ^^ is probably been there all along, but we will notice now that I corrected the bug.
[13:04:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase2003 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[13:05:07] <godog>	 yep very likely
[13:05:27] <_joe_>	 I'm going to debug it for a second
[13:05:33] <godog>	 I have to run to lunch, moritzm mobrovac page if you need anything
[13:06:09] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase-test2001 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:
[13:06:30] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase2004 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[13:06:43] <mobrovac>	 hm so content-encodig
[13:06:46] <mobrovac>	 interesting
[13:06:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wik
[13:06:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wiki
[13:07:03] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1761630 (10Andrew) thanks for fixing, sorry my patch was dumb :(
[13:07:17] <mobrovac>	 _joe_: could you ack all those for RB?
[13:07:36] <_joe_>	 mobrovac: yeah but this is a real problem
[13:07:46] <_joe_>	 mobrovac: rb responds with content-encoding: gzip
[13:07:53] <_joe_>	 but then has uncompressed output
[13:08:04] <mobrovac>	 _joe_: i agree, we'll look into it
[13:08:22] <mobrovac>	 ah i think i know why
[13:08:42] <_joe_>	 I'll open a ticket and ack the alarms
[13:08:53] <mobrovac>	 cool, you can assign it to me _joe_
[13:08:56] <mobrovac>	 thnx
[13:11:19] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase2001 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[13:12:26] <wikibugs>	 6operations, 10RESTBase, 6Services: restbase endpoint reporting incorrect content-encoding: gzip - https://phabricator.wikimedia.org/T116911#1761636 (10Joe) 3NEW a:3mobrovac
[13:12:44] <moritzm>	 !log repooled restbase1002
[13:12:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:16:40] <moritzm>	 !log depooled restbase1003 for kernel update
[13:16:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:19:39] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase2002 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[13:22:21] <grrrit-wm>	 (03PS1) 10Hashar: contint: install nodejs-legacy on Debian [puppet] - 10https://gerrit.wikimedia.org/r/249391 
[13:24:02] <grrrit-wm>	 (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/249391 (owner: 10Hashar)
[13:24:52] <grrrit-wm>	 (03CR) 10Hashar: [C: 04-1] "fails on Precise ..." [puppet] - 10https://gerrit.wikimedia.org/r/249391 (owner: 10Hashar)
[13:25:50] <grrrit-wm>	 (03PS2) 10Hashar: contint: install nodejs-legacy on Debian [puppet] - 10https://gerrit.wikimedia.org/r/249391 
[13:26:25] <grrrit-wm>	 (03CR) 10Milimetric: [C: 031] Correct pageview to dumps synchro [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal)
[13:28:30] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy
[13:28:36] <grrrit-wm>	 (03CR) 10Hashar: [C: 031 V: 031] "cherry picked on integration puppetmaster . Pass on all distributions." [puppet] - 10https://gerrit.wikimedia.org/r/249391 (owner: 10Hashar)
[13:34:49] <grrrit-wm>	 (03PS2) 10Milimetric: Correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal)
[13:34:55] <moritzm>	 !log repooled restbase1003
[13:34:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:35:24] <moritzm>	 !log depooled restbase1004 for kernel update
[13:35:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:43:37] <wikibugs>	 7Puppet, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 5Patch-For-Review, and 2 others: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1761698 (10hashar) All good on permanent slaves.  When https://gerrit.wikimedia.org/r/#/c/244748/ is merged, we ca...
[13:47:33] <moritzm>	 !log repooled restbase1004
[13:47:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:54:19] <moritzm>	 !log depooled restbase1005 for kernel update
[13:54:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:56:18] <wikibugs>	 7Puppet, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 5Patch-For-Review, and 2 others: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1761752 (10hashar)
[13:57:19] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[13:57:20] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:58:03] * MatmaRex gently pokes godog about https://phabricator.wikimedia.org/T111838
[14:00:10] <grrrit-wm>	 (03PS2) 10Jcrespo: Increase the host and account size on P_S config [puppet] - 10https://gerrit.wikimedia.org/r/249385 
[14:01:12] <kart_>	 oh. Forgot to save deployment page earlier :/
[14:01:28] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Increase the host and account size on P_S config [puppet] - 10https://gerrit.wikimedia.org/r/249385 (owner: 10Jcrespo)
[14:05:26] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: role::statistics: fix file path [puppet] - 10https://gerrit.wikimedia.org/r/249396 
[14:05:57] <moritzm>	 !log repooled restbase1005
[14:06:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:06:21] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: role::statistics: fix file path [puppet] - 10https://gerrit.wikimedia.org/r/249396 
[14:06:35] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/249396 (owner: 10Giuseppe Lavagetto)
[14:08:59] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Enable ferm on tin [puppet] - 10https://gerrit.wikimedia.org/r/249398 
[14:09:10] <icinga-wm>	 RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[14:09:21] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure: Install Jenkins Job Builder on gallium - https://phabricator.wikimedia.org/T45141#1761789 (10hashar) 5Open>3declined a:3hashar For now on we deploy them manually.  Maybe we will get that moved to use scap3 and have it generate the jobs directly from t...
[14:09:32] <moritzm>	 !log depooled restbase1006 for kernel update
[14:09:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:10:13] <wikibugs>	 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1761796 (10hashar)
[14:10:19] <godog>	 MatmaRex: yup I've seen that but not a lot of bandwidth atm
[14:11:49] <MatmaRex>	 godog: are you really the only person who'd be able to look into that?
[14:12:24] <grrrit-wm>	 (03PS1) 10Subramanya Sastry: WIP: Update parsoid server.js path + removed stale parsoid file [puppet] - 10https://gerrit.wikimedia.org/r/249399 
[14:12:30] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: unblacklist 'max' metric [puppet] - 10https://gerrit.wikimedia.org/r/249400 (https://phabricator.wikimedia.org/T116913) 
[14:12:42] <MatmaRex>	 (also, what has higher priority that data loss bugs? :/ i sure hope the whole site not going down is not depending on you propping it up all the time)
[14:13:53] <godog>	 MatmaRex: no I'm certainly not the only person
[14:18:19] <grrrit-wm>	 (03PS1) 10Matthias Mullie: Expire Flow caches after 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249402 (https://phabricator.wikimedia.org/T94029) 
[14:19:15] <grrrit-wm>	 (03CR) 10Matthias Mullie: [C: 04-2] "Do not merge before https://gerrit.wikimedia.org/r/#/c/247575/ has beem in production for awhile & having checked its impact." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249402 (https://phabricator.wikimedia.org/T94029) (owner: 10Matthias Mullie)
[14:20:28] <wikibugs>	 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1761834 (10saper) Question: wouldn't that be possible to ship the certificate as a parameter to `$wgForeignXXXRepos` and not...
[14:21:15] <grrrit-wm>	 (03PS3) 10Ottomata: Abstract rsync classes into a define, correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal)
[14:21:31] <moritzm>	 !log repooled restbase1006
[14:21:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:21:49] <icinga-wm>	 RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:21:49] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Abstract rsync classes into a define, correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal)
[14:22:37] <kart_>	 !log T112626 Finished running fix-stats.php for CX (from rwwiki to zuwiki)
[14:22:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:23:10] <grrrit-wm>	 (03PS4) 10Ottomata: Abstract rsync classes into a define, correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal)
[14:24:00] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 04-1] "files/misc/parsoid seems to be used by File['/usr/bin/parsoid'] in the role::parsoid::common class (manifests/role/parsoid.pp:22)" [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry)
[15:38:38] <icinga-wm>	 PROBLEM - Host labstore1002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:38:39] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] "That is needed to setup zuul-merger on scandium. Else the patch it merges will be reported as being on zuul.eqiad.wmnet which is gallium " [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[15:38:42] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:38:42] <grrrit-wm>	 (03Draft1) 10Hashar: (WIP) contint: scandium configuration (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/249380 
[15:38:43] <icinga-wm>	 RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:38:43] <icinga-wm>	 PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:38:43] <Coren>	 Hm.  If we needed confirmation that NFS issues would alert noisily...
[15:38:44] <grrrit-wm>	 (03CR) 10Hashar: "This will get us access to scandium for the preliminary setup. It grants me root access like on gallium, will probably want to remove it o" [puppet] - 10https://gerrit.wikimedia.org/r/249380 (owner: 10Hashar)
[15:38:45] <icinga-wm>	 PROBLEM - Cassandra CQL query interface on restbase-test2003 is CRITICAL: Connection refused
[15:38:45] <icinga-wm>	 PROBLEM - Restbase root url on restbase-test2003 is CRITICAL: Connection refused
[15:38:45] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase-test2003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[15:38:46] <icinga-wm>	 ACKNOWLEDGEMENT - Host labstore1002 is DOWN: PING CRITICAL - Packet loss = 100% Coren Switch in progress
[15:38:46] <grrrit-wm>	 (03CR) 10Andrew Bogott: "This presumes that the merger is always running on the same host as Zuul -- is that a safe assumption? Otherwise we could use a hiera set" [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[15:38:46] <grrrit-wm>	 (03PS2) 10Subramanya Sastry: WIP: Update parsoid server.js path [puppet] - 10https://gerrit.wikimedia.org/r/249399 
[15:38:46] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "That looks like a stale reference. But, I'll deal with that stale file and references in a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry)
[15:38:46] <grrrit-wm>	 (03PS1) 10coren: Labs: switch active labstore [puppet] - 10https://gerrit.wikimedia.org/r/249408 (https://phabricator.wikimedia.org/T107038) 
[15:38:46] <moritzm>	 ebernhardson, legoktm: I plan to enable ferm/firewall on tin before morning swat (for which you two are listed ATM). I don't expect any interference from the change, but if what you're planning to deploy is too critical I can also defer to another day
[15:38:46] <grrrit-wm>	 (03CR) 10coren: [C: 04-1] "Merge only after succesful switch." [puppet] - 10https://gerrit.wikimedia.org/r/249408 (https://phabricator.wikimedia.org/T107038) (owner: 10coren)
[15:38:47] <grrrit-wm>	 (03CR) 10Hashar: "That is the over side. This patch stop assuming the merger run on the same host as Zuul scheduler." [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[15:38:48] <grrrit-wm>	 (03PS16) 10Ottomata: burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy)
[15:38:48] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy)
[15:38:48] <ebernhardson>	 moritzm: worst case mine can go out 8 hours later, its annoying but not the end of the world
[15:38:48] <moritzm>	 ebernhardson: ok, thanks. if anyone should cause problems, I'll be able to spot it quickly in the logs anyway
[15:38:48] <moritzm>	 anything, not anyone...
[15:38:49] <icinga-wm>	 PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures
[15:38:49] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Enable ferm on tin [puppet] - 10https://gerrit.wikimedia.org/r/249398 
[15:38:50] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on tin [puppet] - 10https://gerrit.wikimedia.org/r/249398 (owner: 10Muehlenhoff)
[15:38:53] <legoktm>	 moritzm: nope, nothing too critical
[15:38:55] * aude panics to put things in swat :)
[15:38:55] <moritzm>	 ebernhardson, legoktm: I've enabled it and added logging, ping me when you started and I'll monitor the logs if anything gets dropped
[15:38:55] <moritzm>	 no need to panic :-)
[15:38:58] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[15:38:58] <ebernhardson>	 legoktm: you wanna deploy this time around? :)
[15:38:58] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[15:38:58] <legoktm>	 ha, sure
[15:38:58] <legoktm>	 ebernhardson: is there any order your patches need to be deployed in?
[15:38:58] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[15:38:58] <ebernhardson>	 legoktm: my two patches are ordered (but not in a way that breaks things). Basically the first one has to go out and settle for a bit such that web pages stop requesting schema.Search from RL immediatly on page load
[15:38:58] <ebernhardson>	 legoktm: by loading that so early in the process basically resource loader is serving up the schema without having received the code for some % of the time
[15:38:58] <ebernhardson>	 (because its a reasonable chance of being loaded from a machine that hasn't gotten the sync yet)
[15:38:58] <ebernhardson>	 so, if you could do my first one, then yours, then my last one, should be enough time
[15:38:58] <grrrit-wm>	 (03PS1) 10Aude: Add pageterms mobile api params for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249412 
[15:38:58] <aude>	 ebernhardson: are you doing swat?
[15:38:59] <ebernhardson>	 aude: lego is
[15:38:59] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add pageterms mobile api params for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249412 (owner: 10Aude)
[15:38:59] <aude>	 ah
[15:38:59] <ebernhardson>	 also working out what the query that needs to be added to fatalmonitor is, is basically just curl + jq
[15:38:59] <grrrit-wm>	 (03PS2) 10Aude: Add pageterms mobile api params for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249412 
[15:39:02] <grrrit-wm>	 (03CR) 10Andrew Bogott: [V: 032] contint: set Zuul URL based on server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[15:39:02] <grrrit-wm>	 (03PS2) 10Andrew Bogott: contint: set Zuul URL based on server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[15:39:02] <grrrit-wm>	 (03PS2) 10Hashar: contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 
[15:39:02] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] contint: set Zuul URL based on server fqdn [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[15:39:02] <grrrit-wm>	 (03PS3) 10Hashar: contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) 
[15:39:03] <grrrit-wm>	 (03CR) 10Hashar: "I have added reference to Bug: T95046 and filled T116921 to remember to remove the root access." [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[15:39:05] <logmsgbot>	 !log legoktm@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/modules/ext.wikimediaEvents.search.js: https://gerrit.wikimedia.org/r/#/c/249405/ (duration: 00m 18s)
[15:39:05] <legoktm>	 ebernhardson: ^ doing wmf4 now
[15:39:05] <icinga-wm>	 PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: puppet fail
[15:39:05] <aude>	 legoktm: i'd like https://gerrit.wikimedia.org/r/#/c/249412/ in swat
[15:39:05] <aude>	 only for test.wikidata for now
[15:39:05] <legoktm>	 ok, I'll do that next while the core patch merges
[15:39:05] <aude>	 ok, thanks
[15:39:05] <grrrit-wm>	 (03CR) 10Hashar: "/etc/zuul/zuul-merger.conf" [puppet] - 10https://gerrit.wikimedia.org/r/249389 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[15:39:05] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: fix HeapDumpPath and ErrorFile settings [puppet] - 10https://gerrit.wikimedia.org/r/249419 (https://phabricator.wikimedia.org/T116814) 
[15:39:05] <logmsgbot>	 !log legoktm@tin Synchronized php-1.27.0-wmf.4/extensions/WikimediaEvents/modules/ext.wikimediaEvents.search.js: https://gerrit.wikimedia.org/r/#/c/249404/ (duration: 00m 18s)
[15:39:05] <legoktm>	 ebernhardson: ^
[15:39:05] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Add pageterms mobile api params for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249412 (owner: 10Aude)
[15:39:05] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add pageterms mobile api params for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249412 (owner: 10Aude)
[15:39:05] <moritzm>	 ebernhardson, legoktm: no problems from the firewall rules, BTW. no traffic not covered by the existing rules gets dropped
[15:39:06] <ebernhardson>	 legoktm: looks sane, but i basically have to wait for RL to start sending that out (a few minutes)
[15:39:06] <logmsgbot>	 !log legoktm@tin Synchronized wmf-config/Wikibase.php: https://gerrit.wikimedia.org/r/#/c/249412/ (duration: 00m 17s)
[15:39:06] <aude>	 looking
[15:39:06] <legoktm>	 aude: ^
[15:39:06] <ebernhardson>	 bd808: is there some proxy i can talk to from fluorine to query logstash elasticsearch?
[15:39:06] <aude>	 hmm, might not be perfect but not horribly broken or such
[15:39:06] * aude investigates and probably has follow up later
[15:39:06] <aude>	 ok, wikidata isbroken
[15:39:06] <aude>	 can't be related
[15:39:06] <aude>	 https://www.wikidata.org/wiki/Special:Random
[15:39:06] <aude>	 legoktm: can we revert?
[15:39:06] <aude>	 in case it's related
[15:39:06] <legoktm>	 wait what
[15:39:06] <legoktm>	 ok
[15:39:06] <aude>	 (Cannot access the database: Unknown database 'wikidatawiki' (10.64.0.19))
[15:39:06] <aude>	 which i can't imagine how my change would cause that
[15:39:06] <aude>	 and test.wikidata is ok
[15:39:06] <grrrit-wm>	 (03PS1) 10Legoktm: Revert "Add pageterms mobile api params for test.wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249420 
[15:39:06] <aude>	 thanks
[15:39:06] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032 V: 032] Revert "Add pageterms mobile api params for test.wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249420 (owner: 10Legoktm)
[15:39:06] <aude>	 oh
[15:39:06] <aude>	 yes, my mistake
[15:39:06] <aude>	 only affects wikidata
[15:39:06] <bd808>	 ebernhardson: You should be able to talk to the apache that runs logstash.wm.o from anywhere and run queries that would be possible through Kibana, but I haven't tried to actually do it.
[15:39:06] <bd808>	 ebernhardson: I cheat and ssh into logstash hosts
[15:39:06] <logmsgbot>	 !log legoktm@tin Synchronized wmf-config/Wikibase.php: revert (duration: 00m 17s)
[15:39:06] <legoktm>	 aude: random is working now
[15:39:06] <aude>	 thanks
[15:39:06] <ebernhardson>	 bd808: yea thats where i tested, but i'm trying to add the count of errors to fatalmonitor on flourine. I have the query + jq command worked out just have to figure out how to run it now
[15:39:06] <ebernhardson>	 bd808: i'll try the public endpoint, thanks
[15:39:06] <bd808>	 it needs http basic auth :?
[15:39:06] <aude>	 totally my fault, but thanks
[15:39:06] <ebernhardson>	 bd808: oh the frnotend would, yea
[15:39:06] <legoktm>	 ok, I'm going to do my core/Echo patches now
[15:39:06] <ebernhardson>	 bd808: maybe i'll just add a fermi exception
[15:39:06] <bd808>	 The cluster was open to $INTERNAL until this week but we closed it down to protect sensitive log data
[15:39:06] <grrrit-wm>	 (03CR) 10Eevans: [C: 031] cassandra: fix HeapDumpPath and ErrorFile settings [puppet] - 10https://gerrit.wikimedia.org/r/249419 (https://phabricator.wikimedia.org/T116814) (owner: 10Filippo Giunchedi)
[15:39:07] <ebernhardson>	 bd808: fluorine is rather limited (Deployers, plus a group that can read logs but not run anything), but if thats not limited enough i can figure something else out
[15:39:07] <ebernhardson>	 since the logs are already there, imo its not a big deal, but willing to work around
[15:39:08] <bd808>	 ebernhardson: *nod* I think csteipp was mostly concerned about any internal host having access
[15:39:08] <logmsgbot>	 !log legoktm@tin Synchronized php-1.27.0-wmf.4/includes/: LinksUpdate: Keep track of the triggering User - https://gerrit.wikimedia.org/r/#/c/249350/ (duration: 00m 22s)
[15:39:08] <bd808>	 if it is a reasonably restricted host he will probably be ok with it
[15:39:08] <ebernhardson>	 bd808: ok thanks ill check
[15:39:08] <gwicke>	 !log restarting cassandra on restbase1007 with 20G heap
[15:39:08] <bd808>	 ebernhardson: I actually have a toy written in go just for querying the ELK cluster.
[15:39:08] <bd808>	 https://github.com/bd808/ggml
[15:39:09] <logmsgbot>	 !log legoktm@tin Synchronized php-1.27.0-wmf.4/extensions/Echo/: https://gerrit.wikimedia.org/r/#/c/249351/ and https://gerrit.wikimedia.org/r/#/c/249411/ (duration: 00m 18s)
[15:39:09] <legoktm>	 ebernhardson: ready for your second set of patches?
[15:39:10] <ebernhardson>	 legoktm: should be ready by now yes, thakns
[15:39:11] <wikibugs>	 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1761845 (10fgiunchedi) 5Resolved>3Open this isn't resolved as mobileapps still seems slow, related {T116770} is resolved though
[15:39:11] <wikibugs>	 6operations, 10RESTBase, 6Services, 7RESTBase-architecture: Update restbase100[1-6] to the 3.19 kernel - https://phabricator.wikimedia.org/T102234#1761851 (10MoritzMuehlenhoff) 5Open>3Resolved restbase100[1-6] have been updated to the 3.19 kernel.
[15:39:11] <wikibugs>	 6operations, 5Patch-For-Review: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1761853 (10MoritzMuehlenhoff)
[15:39:12] <logmsgbot>	 !log legoktm@tin Synchronized php-1.27.0-wmf.4/extensions/WikimediaEvents/WikimediaEvents.php: https://gerrit.wikimedia.org/r/249317 (duration: 00m 17s)
[15:39:13] <icinga-wm>	 RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 918684 bytes in 2.966 second response time
[15:39:33] <grrrit-wm>	 (03PS12) 10Alex Monk: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man)
[15:40:04] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man)
[15:40:19] <grrrit-wm>	 (03PS13) 10Alex Monk: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man)
[15:40:28] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.048 second response time
[15:40:31] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 10Traffic, 5Patch-For-Review: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1762095 (10hashar) 5Open>3Resolved {T103660} has finally been solved. That was the last Varnish cache still using Trusty.
[15:40:40] <logmsgbot>	 !log legoktm@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php: https://gerrit.wikimedia.org/r/249316 (duration: 00m 18s)
[15:40:43] <legoktm>	 ebernhardson: all done ^^
[15:40:57] <hashar>	 bblack: for information all Varnishes on beta cluster are now using Jessie :-}
[15:41:12] <grrrit-wm>	 (03CR) 10coren: [C: 031] "This needs merging now so that some ancilliary labstore services move to 1001 (backups, mostly)" [puppet] - 10https://gerrit.wikimedia.org/r/249408 (https://phabricator.wikimedia.org/T107038) (owner: 10coren)
[15:41:30] <grrrit-wm>	 (03CR) 10Muehlenhoff: "PS13 includes base::firewall twice?" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man)
[15:41:37] <ebernhardson>	 legoktm: ok checking
[15:41:47] <icinga-wm>	 RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[15:42:01] <Coren>	 YuviPanda: andrewbogott: When either of you have a minute, https://gerrit.wikimedia.org/r/#/c/249408/
[15:42:32] <ebernhardson>	 legoktm: everything looks reasonable
[15:42:32] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 04-1] Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man)
[15:42:34] <ebernhardson>	 legoktm: thanks
[15:42:47] <legoktm>	 awesome
[15:42:48] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Labs: switch active labstore [puppet] - 10https://gerrit.wikimedia.org/r/249408 (https://phabricator.wikimedia.org/T107038) (owner: 10coren)
[15:42:51] <legoktm>	 SWAT is done!
[15:42:57] <andrewbogott>	 Coren: I’ll merge as soon as jenkins catches up
[15:43:46] <Coren>	 And in today's good news - even the unplanned issue didn't detract from the succesful switch and instances recovered wondering why they have a 1h blackout.  :-)
[15:44:01] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Labs: switch active labstore [puppet] - 10https://gerrit.wikimedia.org/r/249408 (https://phabricator.wikimedia.org/T107038) (owner: 10coren)
[15:45:45] <andrewbogott>	 morebots, how’s it going?
[15:45:45] <morebots>	 I am a logbot running on tools-exec-1203.
[15:45:45] <morebots>	 Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log.
[15:45:45] <morebots>	 To log a message, type !log <msg>.
[15:46:04] <andrewbogott>	 !log testing the bot after the nfs move
[15:46:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:49:44] <grrrit-wm>	 (03PS1) 10Chad: dsh: remove scap-test group, unused [puppet] - 10https://gerrit.wikimedia.org/r/249428 
[15:50:10] <ostriches>	 mutante: hehe ^ :)
[15:51:30] <grrrit-wm>	 (03Abandoned) 10Chad: beta: Start using parsoid cache 04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247332 (owner: 10Chad)
[15:52:04] <grrrit-wm>	 (03PS9) 10Chad: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)
[15:52:11] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] "Ori made this for testing HHVM restarts during scap. We tested, found out it was a horrible idea since we couldn't properly signal pybal t" [puppet] - 10https://gerrit.wikimedia.org/r/249428 (owner: 10Chad)
[15:52:39] <grrrit-wm>	 (03PS1) 10Aude: Add settings for displaying labels on test.wikidata in mobile search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249431 
[15:54:33] <grrrit-wm>	 (03CR) 10Andrew Bogott: "This is great! I'm going to add a few more clarifying comments, while we're at it..." [puppet] - 10https://gerrit.wikimedia.org/r/249342 (owner: 10Dzahn)
[15:55:07] <ottomata>	 moritzm:  any insights on this burrow package?
[15:56:49] <grrrit-wm>	 (03PS2) 10Aude: Add settings for displaying labels on test.wikidata in mobile search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249431 
[15:57:12] <grrrit-wm>	 (03CR) 10Aude: "have manually tested this locally and it works properly, regardless of how $wgMFUseWikibaseDescription is set (and we are going to enable " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249431 (owner: 10Aude)
[15:58:10] <aude>	 greg-g: legoktm i'd like to try deploying https://gerrit.wikimedia.org/r/#/c/249431/
[15:58:36] <aude>	 more carefully locally tested it that it works ok
[15:58:41] <aude>	 as intended :)
[15:59:01] <legoktm>	 aude: I think greg is out today :( 
[15:59:05] <aude>	 ok
[15:59:10] <legoktm>	 do you want me to deploy it or were you?
[15:59:17] <aude>	 if you want to , or i can
[15:59:33] <legoktm>	 uhh, could you? :)
[15:59:34] <aude>	 would be nice to have this stuff live on test.wikidata a bit before we enable onw ikidata
[15:59:40] <aude>	 sure :)
[15:59:42] <legoktm>	 might be a good idea to try it on mw1017 first
[16:00:19] <aude>	 i think it's ok, but sure :)
[16:00:19] * aude stuck the config inside an existing if/else block for testwikidata
[16:00:48] * aude tries on mw1017
[16:01:06] <wikibugs>	 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1762160 (10coren)
[16:01:10] <wikibugs>	 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 5 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1762157 (10coren) 5Open>3Resolved This was confirmed in trial by fire with the switch of labs NFS back to labstore1001.
[16:03:49] <wikibugs>	 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Inspect and diagnose labstore1001's H800 controler - https://phabricator.wikimedia.org/T95293#1762168 (10coren) 5Open>3Resolved a:3coren Resolved by the switchover test to end all switchover tests: labstore1001 is now back to being the pri...
[16:07:45] <grrrit-wm>	 (03CR) 10Aude: [C: 032] Add settings for displaying labels on test.wikidata in mobile search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249431 (owner: 10Aude)
[16:07:52] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add settings for displaying labels on test.wikidata in mobile search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249431 (owner: 10Aude)
[16:11:21] <ostriches>	 legoktm, aude: Yeah, greg's out today, pulled his shoulder.
[16:13:18] <aude>	 :(
[16:13:33] <aude>	 i think greg knows what we are doing generally and it's ok :)
[16:13:50] * aude will be enabling more geodata stuff on wikidata tomorrow and will put on the deployment calendar
[16:15:00] <aude>	 all good on mw1017 :)
[16:15:52] <logmsgbot>	 !log aude@tin Synchronized wmf-config/InitialiseSettings.php: enable wikibase descriptions on test.wikidata (duration: 00m 17s)
[16:15:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:18:14] <grrrit-wm>	 (03CR) 10Eevans: [C: 031] cassandra: unblacklist 'max' metric [puppet] - 10https://gerrit.wikimedia.org/r/249400 (https://phabricator.wikimedia.org/T116913) (owner: 10Filippo Giunchedi)
[16:18:18] <logmsgbot>	 !log aude@tin Synchronized wmf-config/Wikibase.php: add settings for displaying labels on test.wikidata in mobile (duration: 00m 18s)
[16:18:21] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:18:44] <aude>	 wikidata is ok :)
[16:19:32] <aude>	 done
[16:21:13] <icinga-wm>	 RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[16:22:32] <wikibugs>	 6operations, 10netops: drain ULSFO of all traffic on 2015-11-02 @ 0900 PST - https://phabricator.wikimedia.org/T116928#1762244 (10RobH) 3NEW a:3faidon
[16:22:49] <grrrit-wm>	 (03CR) 10Andrew Bogott: "I have a new version of this patch prepared but can't submit for network reasons... will upload in a couple of hours." [puppet] - 10https://gerrit.wikimedia.org/r/249342 (owner: 10Dzahn)
[16:23:14] <wikibugs>	 6operations, 10netops: drain ULSFO of all traffic on 2015-11-02 @ 0900 PST - https://phabricator.wikimedia.org/T116928#1762244 (10RobH) @Faidon,  Please advise if this is not an appropriate time to do this.  Previous discussion (linked off T116924) denotes that 'anytime' works.
[16:25:53] <wikibugs>	 6operations, 10netops: drain ULSFO of all traffic on 2015-11-02 before 0900 PST - https://phabricator.wikimedia.org/T116928#1762269 (10RobH)
[16:26:18] <grrrit-wm>	 (03PS1) 10Ottomata: Open up port 8000 for burrow (on krypton) [puppet] - 10https://gerrit.wikimedia.org/r/249433 (https://phabricator.wikimedia.org/T115669) 
[16:28:12] <grrrit-wm>	 (03CR) 10Muehlenhoff: Open up port 8000 for burrow (on krypton) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249433 (https://phabricator.wikimedia.org/T115669) (owner: 10Ottomata)
[16:28:33] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase-test2002 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:
[16:28:54] <grrrit-wm>	 (03PS2) 10Ottomata: Open up port 8000 for burrow (on krypton) [puppet] - 10https://gerrit.wikimedia.org/r/249433 (https://phabricator.wikimedia.org/T115669) 
[16:29:27] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] Open up port 8000 for burrow (on krypton) [puppet] - 10https://gerrit.wikimedia.org/r/249433 (https://phabricator.wikimedia.org/T115669) (owner: 10Ottomata)
[16:30:39] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Open up port 8000 for burrow (on krypton) [puppet] - 10https://gerrit.wikimedia.org/r/249433 (https://phabricator.wikimedia.org/T115669) (owner: 10Ottomata)
[16:31:21] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: fix HeapDumpPath and ErrorFile settings [puppet] - 10https://gerrit.wikimedia.org/r/249419 (https://phabricator.wikimedia.org/T116814) 
[16:31:29] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: fix HeapDumpPath and ErrorFile settings [puppet] - 10https://gerrit.wikimedia.org/r/249419 (https://phabricator.wikimedia.org/T116814) (owner: 10Filippo Giunchedi)
[16:36:35] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1762316 (10ellery) I have never used data on landing page impressions (I only use banner impressions and clicks).
[16:37:22] <mutante>	 moritzm: yay @ ferm on tin
[16:37:26] <grrrit-wm>	 (03PS2) 10Ottomata: reqstats: test on cp1065 (eqiad text) as well [puppet] - 10https://gerrit.wikimedia.org/r/249237 (owner: 10BBlack)
[16:38:22] <grrrit-wm>	 (03PS2) 10Dzahn: dsh: remove scap-test group, unused [puppet] - 10https://gerrit.wikimedia.org/r/249428 (owner: 10Chad)
[16:38:51] <moritzm>	 mutante: indeed, if you want to you can merge the unification patch for tin/mira later
[16:39:03] <mutante>	 moritzm: but not the changed to unify mira and tin yet...looking why
[16:39:08] <mutante>	 change
[16:39:14] <mutante>	 yes, that:) ok
[16:40:21] <moritzm>	 the isolated enable fix was easier to revert in case of problems
[16:40:37] <mutante>	 makes sense. yep. i'm amending and doing that now
[16:40:50] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] reqstats: test on cp1065 (eqiad text) as well [puppet] - 10https://gerrit.wikimedia.org/r/249237 (owner: 10BBlack)
[16:41:12] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: unblacklist 'max' metric [puppet] - 10https://gerrit.wikimedia.org/r/249400 (https://phabricator.wikimedia.org/T116913) 
[16:41:23] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: unblacklist 'max' metric [puppet] - 10https://gerrit.wikimedia.org/r/249400 (https://phabricator.wikimedia.org/T116913) (owner: 10Filippo Giunchedi)
[16:41:49] <grrrit-wm>	 (03PS14) 10Dzahn: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man)
[16:41:52] <icinga-wm>	 PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/en.wikipedia.org/v1/page/mobile-html/Main_Page: Generic connection error: (Received response with content-encoding: gzip, but failed to decode it., error(Error -3 while decompressing: incorrect header check,)): /page/mobile-html/{title} is CRITICAL: Could not fetch url http://127.0.0.1:7231/
[16:42:47] <grrrit-wm>	 (03PS15) 10Dzahn: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man)
[16:43:13] <ottomata>	 hmm, bblack
[16:43:13] <ottomata>	 Oct 28 16:41:47 cp1065 systemd[1]: [/lib/systemd/system/varnishreqstats-frontend.service:3] Failed to add dependency on varnish-frontend, ignoring: Invalid argument
[16:43:13] <ottomata>	 Oct 28 16:41:47 cp1065 systemd[1]: [/lib/systemd/system/varnishreqstats-frontend.service:4] Failed to add dependency on varnish-frontend, ignoring: Invalid argument
[16:44:38] <mutante>	 moritzm: i checked if mira and tin have the same number of iptables lines.. tin has a few more. one seems to be LOGGING. i assume that was to check all is ok
[16:44:52] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 031] "works for me, but I would like Chase to review" [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[16:45:54] <moritzm>	 mutante: I dropped some logging rules during the morning swat
[16:46:27] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "tin has firewalling now:) so they are both identical" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man)
[16:46:56] <mutante>	 moritzm: yep, the diff is only the logging. alright
[16:47:31] <mutante>	 @seen hoo
[16:47:31] <wm-bot>	 mutante: Last time I saw hoo they were quitting the network with reason: Quit: AndroIRC - Android IRC Client ( http://www.androirc.com ) N/A at 10/28/2015 1:14:05 PM (3h33m26s ago)
[16:49:54] <wikibugs>	 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1762373 (10Dzahn) tin got firewalling during this morning's swat deploy.  that meant tin and mira are now identical and we could merge hoo's change above to reflect...
[16:50:33] <wikibugs>	 6operations, 10Continuous-Integration-Config, 5Patch-For-Review: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1762381 (10Andrew) 5Open>3Resolved
[16:50:47] <mutante>	 moritzm: i fixed puppet compiler run for stuff on puppetmaster, so the palladium change +1 
[16:51:00] <mutante>	 (by adding fake new_install keys to labs/private)
[16:51:03] <moritzm>	 ok, I'll have a look later or tomorrow
[16:51:22] <mutante>	 mira/tin look all fine
[16:51:26] <mutante>	 great
[16:51:33] <wikibugs>	 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1762392 (10bd808) my $0.02:  Fatalmonitor is hack of tail + sed + awk + watch, and only sees things that are reported to hhvm.log.  Adding a curl + jq component to it that reports one number is not going to give...
[16:51:34] <ostriches>	 We need to land the co-master thing tho.
[16:52:04] <ostriches>	 https://gerrit.wikimedia.org/r/#/c/224829/
[16:52:43] <mutante>	 ok, just looked at the pending blockers for that tracking ticket
[16:52:55] <wikibugs>	 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1762411 (10GWicke) I think this was caused by a bug in the mobile app end point monitoring spec. See https://github.com/wikimedia/restbase/pull/389 for a fix.
[16:52:57] <mutante>	 for codfw deployment server
[16:53:13] <mutante>	 "[scap] Add support for syncing /srv/mediawiki-staging including fully working git data to warm spare deploy server"
[16:53:16] <ostriches>	 Yep :)
[16:53:24] <mutante>	 ok
[16:53:29] <ostriches>	 The scap bits have already been merged.
[16:53:33] <mutante>	 nice
[16:53:40] <ostriches>	 Just need the puppet config.
[16:53:53] <wikibugs>	 6operations: Opendj on Neptunium running java 6, on Nembus java 7 - https://phabricator.wikimedia.org/T107424#1762423 (10Andrew) 5Open>3declined I think this is moot -- It works, and we're hoping to kill off opendj anyway.
[16:54:06] <moritzm>	 !log installed openjdk security updates on zookeeper hosts
[16:54:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:54:27] <mutante>	 hm, i see a bit of sudo and wrapper script discussion there.. yea..
[16:54:48] <godog>	 !log unblacklist 'max' cassandra metrics and restart cassandra-metrics-collector
[16:54:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:55:20] <grrrit-wm>	 (03CR) 10Chad: "Although, the discussion now is about moving scap towards using the info directly from etcd (or having etcd write the dsh files) which wou" [puppet] - 10https://gerrit.wikimedia.org/r/247324 (owner: 10Chad)
[16:55:29] <grrrit-wm>	 (03PS3) 10Dzahn: dsh: remove scap-test group, unused [puppet] - 10https://gerrit.wikimedia.org/r/249428 (owner: 10Chad)
[16:55:44] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] dsh: remove scap-test group, unused [puppet] - 10https://gerrit.wikimedia.org/r/249428 (owner: 10Chad)
[16:58:08] <mutante>	 jouncebot: next parsoid
[16:58:08] <jouncebot>	 In 1 hour(s) and 1 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T1800)
[16:58:34] <mutante>	 how about parsoid. for https://gerrit.wikimedia.org/r/#/c/249321/
[16:59:19] <grrrit-wm>	 (03CR) 10Rush: [C: 04-1] contint: scandium configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[17:00:34] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.33% of data above the critical threshold [1000.0]
[17:01:36] <grrrit-wm>	 (03CR) 10Hashar: contint: scandium configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[17:01:38] <ostriches>	 twentyafterfour: Any objection to me updating production scap to master before the train?
[17:01:45] <ostriches>	 (it's already running master in beta)
[17:02:04] <hashar>	 chasemp: I am not sure how to set the $cluster and $nagois_contact_group via hiera:-/
[17:02:56] <mutante>	 hashar: contact_group: admins,parsoid
[17:03:02] <mutante>	 for example
[17:03:10] <twentyafterfour>	 ostriches: I don't see any problem with it
[17:03:11] <hashar>	 going to update gallium :)
[17:03:14] <mutante>	 or hosts/gadolinium.yaml:contactgroups: 'admins,analytics'
[17:03:26] <mutante>	 eh, wait
[17:03:37] <mutante>	 contact_group vs. contactgroups 
[17:03:45] <mutante>	 JohnFLewis: ^ :p
[17:04:09] <mutante>	 we have both, but i can confirm "contactgroups" works
[17:04:13] <mutante>	 like here:
[17:04:16] <JohnFLewis>	 mutante: contactgroups:
[17:04:19] <mutante>	 role/common/lists.yaml:contactgroups: admins,mailman-admins
[17:04:23] <mutante>	 hashar: ^ that
[17:04:26] <mutante>	 via role
[17:04:31] <wikibugs>	 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1762460 (10Mooeypoo) I couldn't reproduce the specific problems in this bug, but it also shows several issues with Echo and m...
[17:04:32] <icinga-wm>	 ACKNOWLEDGEMENT - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.33% of data above the critical threshold [1000.0] Filippo Giunchedi expected, cassandra metric unblacklisted
[17:04:36] <wikibugs>	 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1762462 (10GWicke)
[17:04:41] <JohnFLewis>	 it is contactgroups: the underscore felt, eh to me :)
[17:04:43] <grrrit-wm>	 (03PS1) 10Hashar: gallium: migrate cluster/contact to hiera [puppet] - 10https://gerrit.wikimedia.org/r/249441 
[17:04:47] <mutante>	 JohnFLewis: let's fix common/lvs/configuration.yaml:      contact_group: admins,parsoid
[17:05:08] <mutante>	 unless that is not the same 
[17:05:34] <grrrit-wm>	 (03CR) 10John F. Lewis: [C: 04-1] "contactgroups: not contact_group:" [puppet] - 10https://gerrit.wikimedia.org/r/249441 (owner: 10Hashar)
[17:05:46] <mutante>	 hashar: and $cluster is just "cluster: foo"
[17:05:47] <JohnFLewis>	 mutante: should be fixed yeah
[17:05:56] <hashar>	 ok ok :)
[17:06:10] <chasemp>	 hashar: in general for the functions scandium is taking over I would like to define them into role and use the role keyword
[17:06:33] <hashar>	 chasemp: yeah the idea is to use  role zuul::merger
[17:06:48] <hashar>	 but I don't want to get zuul merger enabled there before I get shell access to verify how the network flows behave
[17:06:57] <chasemp>	 ok I understand then thanks
[17:07:03] <hashar>	 sorry should have made it clear
[17:07:17] <wikibugs>	 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1762473 (10Mooeypoo) More to the point, I don't think this should be "blocked". Some aspects of this ticket are under develop...
[17:07:21] <grrrit-wm>	 (03PS2) 10Hashar: gallium: migrate cluster/contact to hiera [puppet] - 10https://gerrit.wikimedia.org/r/249441 
[17:09:15] <wikibugs>	 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1762476 (10Mattflaschen) Yes, my understanding is we took it because although we thought the fix would touch mobile code (and...
[17:10:33] <grrrit-wm>	 (03PS4) 10Hashar: contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) 
[17:12:33] <ostriches>	 bleh, stuck at 15 hosts not fetching.
[17:12:40] * ostriches stabs trebuchet a bit
[17:12:46] <grrrit-wm>	 (03CR) 10Hashar: "The $cluster and nagios contact group are now in hiera similar to how I did it for gallium: https://gerrit.wikimedia.org/r/249441" [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[17:13:39] <ostriches>	 tin and mira among the ones that won't fetch, which are 2 I want, heh
[17:14:40] <hashar>	 ostriches: will you be able to baby sit the patch to stop gerrit replication to gallium ?  https://gerrit.wikimedia.org/r/#/c/244498/     I asked a bit today but without luck :-}
[17:14:48] <ostriches>	 Yeah
[17:15:01] <hashar>	 ostriches: I don't mind handling the cleanup on gallium
[17:15:36] <chasemp>	 mutante: andrewbogott are we in agreement that because https://gerrit.wikimedia.org/r/#/c/249380/4 is access granted for teh same service on a new box it dosn't require any kind of sudo perms approval?  this is lateral migration not escalation
[17:15:38] <_joe_>	 ostriches: let me know if you need help from me for working towards making mira able to deploy
[17:16:31] <andrewbogott>	 chasemp: yeah, I don’t think it counts as new access.
[17:16:46] <grrrit-wm>	 (03PS6) 10Rush: Drop cirrussearch write jobs after 3 hours of failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson)
[17:16:52] <ostriches>	 !log deployed scap master@abe1973 to cluster
[17:16:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:16:59] <hashar>	 chasemp: mutante: andrewbogott: I could use temp root on scandium to do the service implementation. It might not be needed but having it is a nice convenience.  I filled a task to remember to remove the root access
[17:17:11] <ostriches>	 _joe_: It's just that one last patch for puppet :)
[17:17:16] <grrrit-wm>	 (03CR) 10Rush: "what's the story with this patch? We are publishing these updates now right so we need to roll on this to be more safe(-ish)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson)
[17:17:41] <chasemp>	 hashar: you have root on the current zuul merger right?
[17:17:50] <bd808>	 ostriches: and https://gerrit.wikimedia.org/r/#/c/247965/
[17:17:54] <hashar>	 chasemp: I got root on gallium which host zuul server / zuul merger and jenkins
[17:18:05] <ostriches>	 bd808: Oh yeah I was gonna merge that.
[17:18:09] <hashar>	 chasemp: the only difference is scandium is in the labs network
[17:18:20] <chasemp>	 hashar: right ok yeah I'm going to roll on it then, de-escalation is another process entirely
[17:18:24] <hashar>	 \O/
[17:18:26] <chasemp>	 well it's in the support vlan which we said we are ok w/
[17:18:38] <_joe_>	 ostriches: throw that patch to me
[17:18:39] <_joe_>	 :P
[17:18:41] <hashar>	 chasemp: do we have any network map of labs / prod etc?
[17:18:43] <chasemp>	 we talked about this specific instance at the offsite yw! :)
[17:18:50] <ostriches>	 _joe_: https://gerrit.wikimedia.org/r/#/c/224829/
[17:18:53] * hashar loves offsites
[17:18:58] <ostriches>	 bd808: merged
[17:18:59] <chasemp>	 good question, I think andrewbogott made something for labs not sure where it is
[17:19:40] <andrewbogott>	 hashar: this isn’t quite what you asked for, but it’s what we have:  https://wikitech.wikimedia.org/wiki/Labs_infrastructure
[17:20:14] <grrrit-wm>	 (03CR) 10Rush: [C: 032] contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[17:20:19] <hashar>	 andrewbogott: nonetheless very interesting, thanks!
[17:20:26] <_joe_>	 ostriches: uhm that patch needs some work, but I don't have time to review it properly now
[17:20:37] <grrrit-wm>	 (03CR) 10Rush: [V: 032] contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[17:20:44] <ostriches>	 _joe_: What needs work?
[17:21:38] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] contint: scandium configuration [puppet] - 10https://gerrit.wikimedia.org/r/249380 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar)
[17:22:15] <_joe_>	 ostriches: permissions are granted too broadly, there is no reason not to add that permission to the scap masters only
[17:22:24] <_joe_>	 ostriches: I can fix that pretty easily
[17:22:28] <chasemp>	 mutante: so 'us	Submitted, Merge Pending' any idea why?
[17:22:42] <mutante>	 chasemp: yea, i'd agree that migration to a new host with the identical group and people doesn't need it. we should base the access on a role, then this would not be a question
[17:22:53] <mutante>	 chasemp: dependency on a nother change?
[17:23:08] <ostriches>	 !log deployed scap master@f823129 to cluster
[17:23:09] <mutante>	 yea, it needs https://gerrit.wikimedia.org/r/#/c/249441/2 first
[17:23:11] <bd808>	 _joe_: you are of course right, it's only needed on the masters
[17:23:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:23:31] <ostriches>	 bd808: Ok, master including both changes is now in prod.
[17:23:44] <bd808>	 fancy!
[17:23:49] <grrrit-wm>	 (03CR) 10Rush: [C: 032] gallium: migrate cluster/contact to hiera [puppet] - 10https://gerrit.wikimedia.org/r/249441 (owner: 10Hashar)
[17:24:32] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Looks good in general, but I'd like to have these sudo rights to be granted on the masters only. I'm unsure how we should do it though, wi" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)
[17:25:35] <wikibugs>	 6operations, 10ops-eqiad: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1762540 (10hashar) 3NEW
[17:25:44] <wikibugs>	 6operations, 10ops-eqiad, 5Continuous-Integration-Scaling: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1762547 (10hashar)
[17:26:10] <mutante>	 hashar: there is a small diff between gallium and scandium:
[17:26:19] <mutante>	 < ssh::server::disable_nist_kex: false
[17:26:20] <mutante>	 < ssh::server::explicit_macs: false
[17:26:26] <hashar>	 chasemp: I got access to scandium thanks!
[17:26:33] <mutante>	 but that was just needed on gallium because different distro version, right
[17:26:52] <hashar>	 mutante: yeah that is to disable the new ssh keys on gallium because  Jenkins doesn't know those new algorithms
[17:27:01] <mutante>	 saying that because without that difference we could move the hiera stuff to role/common right away and forget about hostnames
[17:27:05] <mutante>	 incl. the admin group
[17:27:21] <hashar>	 mutante: I haven 't checked but zuul-merger that is going to be on scandium should be fine.  Else will have to apply the same settings to scandium unfortunately :-(
[17:27:30] <mutante>	 ok
[17:28:01] <grrrit-wm>	 (03PS10) 10BryanDavis: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) 
[17:28:09] <bd808>	 _joe_: ^
[17:28:27] <wikibugs>	 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1762568 (10hashar) We got shell access thanks to ops reviews!  Will now look at the network flows. Once happy we can apply the zuul::merger role and d...
[17:28:34] <bd808>	 totally untested but ostriches can try it out in beta cluster
[17:28:45] <hashar>	 mutante: andrewbogott: chasemp: JohnFLewis: thank you for the reviews! works for me and that is the end of the day :-}
[17:29:02] <grrrit-wm>	 (03PS3) 10Dzahn: contint: install nodejs-legacy on Debian [puppet] - 10https://gerrit.wikimedia.org/r/249391 (owner: 10Hashar)
[17:29:08] <mutante>	 hashar: np, here's one more change then
[17:29:15] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] contint: install nodejs-legacy on Debian [puppet] - 10https://gerrit.wikimedia.org/r/249391 (owner: 10Hashar)
[17:29:22] <JohnFLewis>	 hashar: welcome ( I guess :P)
[17:31:07] <_joe_>	 bd808: I'll take a look after the SoS
[17:31:26] <ostriches>	 I'll test it on beta in the meantime
[17:31:30] <grrrit-wm>	 (03CR) 10Hashar: [C: 04-1] "Need rebase and have nodejs-legacy included (from https://gerrit.wikimedia.org/r/249391 )" [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar)
[17:31:50] <hashar>	 mutante: gallium is happy. ty!
[17:32:09] <mutante>	 hashar: ok,cool. good night!
[17:33:52] <icinga-wm>	 PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[17:40:51] <grrrit-wm>	 (03CR) 10EBernhardson: "we should send this out, it just slipped my mind. will put it in the afternoon swat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson)
[17:41:21] <grrrit-wm>	 (03CR) 10EBernhardson: "also we are still only sending writes to eqiad and codfw, so this wont have an immediate effect, the labs writes are only turned on for te" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson)
[17:43:15] <ebernhardson>	 :q
[17:56:07] <ostriches>	 bd808, _joe_: Tested beta with new sudo rules, co-master sync worked just fine
[17:56:48] <_joe_>	 ostriches: nice, but I need a pause after the SoS :P
[17:57:05] <ostriches>	 No worries, I'm gonna grab my 2nd coffee.
[17:58:27] <_joe_>	 after all, it's only 11 hours that I'm around :P
[17:58:35] <grrrit-wm>	 (03PS4) 10MaxSem: Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) 
[17:58:48] <grrrit-wm>	 (03PS1) 10Aaron Schulz: Make mysql-multiwrite use getInstance() factory spec [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 
[17:58:59] <AaronSchulz>	 legoktm: ^
[17:59:29] <legoktm>	 AaronSchulz: will that work for the wikis that are still on wmf.3?
[17:59:31] <godog>	 !log rolling-restart cassandra-metrics-collector, staggered this time
[17:59:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:00:04] <jouncebot>	 twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T1800). Please do the needful.
[18:00:13] <AaronSchulz>	 legoktm: the spec change was older, but let me check if it's in 3
[18:02:53] <MaxSem>	 twentyafterfour, can I get your review on https://gerrit.wikimedia.org/r/248526 please?
[18:04:13] <icinga-wm>	 PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: puppet fail
[18:04:57] <grrrit-wm>	 (03CR) 10Mobrovac: "LGTM (won't +1 for now on account of the pending deploy)" [puppet] - 10https://gerrit.wikimedia.org/r/249399 (owner: 10Subramanya Sastry)
[18:06:23] <AaronSchulz>	 legoktm: looks like that didn't make it to wmf3
[18:08:32] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[18:08:47] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 04-2] "Blocked on needing wmf4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249457 (owner: 10Aaron Schulz)
[18:09:29] <legoktm>	 AaronSchulz: if we just want to stop the warnings, we could set 'timeout' in the array definition instead of the global
[18:12:28] <AaronSchulz>	 legoktm: I'll update mc.php
[18:12:54] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki fawiki
[18:12:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:13:46] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki hiwiki
[18:13:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:14:09] <grrrit-wm>	 (03PS1) 10Aaron Schulz: Set "timeout" field for "memcached-pecl" explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249463 
[18:14:37] <yurik>	 akosiaris, hi, how do i deploy tileratorui?
[18:14:43] <yurik>	 i don't see it on itn
[18:14:44] <yurik>	 tin
[18:15:47] <ostriches>	 bd808: It still won't set the mtime on ./
[18:15:53] <ostriches>	 Everything else succeeds.
[18:16:17] <ostriches>	 permissions are /identical/ on both dirs.
[18:16:24] <bd808>	 *grumble*
[18:16:41] <AaronSchulz>	 legoktm: ^
[18:17:18] <grrrit-wm>	 (03CR) 10Legoktm: [C: 031] Set "timeout" field for "memcached-pecl" explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249463 (owner: 10Aaron Schulz)
[18:18:10] <gwicke>	 !log canary deploy of restbase deploy 3b1f6488f2 to restbase1001
[18:18:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:18:29] <bd808>	 ostriches: https://serverfault.com/questions/337766/how-to-allow-multiple-people-to-change-mtime-timestamp-of-a-file-through-sftp/337810#337810
[18:19:03] <bd808>	 Apparently only the owner can set mtime
[18:19:27] <grrrit-wm>	 (03PS1) 10Mobrovac: RESTBase: Strip redundant headers from back-end services [puppet] - 10https://gerrit.wikimedia.org/r/249465 (https://phabricator.wikimedia.org/T116911) 
[18:19:28] <wikibugs>	 6operations, 6Commons, 10Wikimedia-Media-storage, 7Swift: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1762737 (10matmarex) I wouldn't expect the files back at all. Nobody who could do it seems to have time to investigate whether they still exist somewh...
[18:19:45] <grrrit-wm>	 (03PS1) 10coren: Labs: reenable ldap on labstore* [puppet] - 10https://gerrit.wikimedia.org/r/249466 (https://phabricator.wikimedia.org/T116927) 
[18:20:20] <Coren>	 paravoid: ^^
[18:20:56] <ostriches>	 bd808: why hasn't this come up before? We've never assumed deployer was owner
[18:20:57] <yurik>	 bd808, do you know by any chance how production is set up tileratorui?  It shares the same git repo as tilerator service, but gets deployed as a separate service (different configuration file).  I can't seem to find it on tin
[18:21:33] <bd808>	 yurik: I've never heard of it before, sorry
[18:21:49] * bd808 stays aways from the nodejs world mostly
[18:22:02] <yurik>	 bd808, who sets up the tin deployment dirs?
[18:22:09] <bd808>	 puppet
[18:22:25] <ostriches>	 And whomever is deploying said service
[18:22:39] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: scap: Add co-master configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)
[18:22:42] <yurik>	 ostriches, you mean i should manually set up that dir?
[18:22:42] <bd808>	 package with provider == trebuchet makes the initial clone
[18:22:57] <yurik>	 i guess i have to wait for akosiaris to sort it out ^
[18:23:19] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, modulo what _joe_ said re: masters" [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)
[18:23:26] <wikibugs>	 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1762753 (10jcrespo) 5Resolved>3Open We have to drop the views on labs, too.
[18:23:39] <gwicke>	 !log rolling deploy of restbase-deploy 3b1f6488f2 to restbase cluster
[18:23:42] <ostriches>	 No puppet will set it up, I'm just saying the person who sets it up is puppet + whoever is deploying the service there. They'd be the person to ask :)
[18:23:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:24:35] <grrrit-wm>	 (03CR) 10BryanDavis: "The sudoer rules are only applied on masters starting in PS10" [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)
[18:25:31] <AaronSchulz>	 legoktm: want to deploy that change?
[18:25:48] <AaronSchulz>	 or I could go to my other laptop that has my ssh key
[18:27:25] <ori>	 you need a bumper sticker
[18:27:30] <ori>	 "my other laptop has an ssh key"
[18:28:01] <YuviPanda>	 heh
[18:28:33] <bblack>	 "my other ssh key has 5-factor authentication"
[18:28:57] <wikibugs>	 6operations, 6Commons, 10Wikimedia-Media-storage, 7Swift: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1762759 (10aaron) It would either be at the old or the new name. If it's at either then it can be accessed via the right URL (the tricky part is the /...
[18:29:03] <ostriches>	 bblack: One of my factors involves my dog
[18:29:05] <grrrit-wm>	 (03PS1) 10Yuvipanda: admin: Prune unused key [puppet] - 10https://gerrit.wikimedia.org/r/249468 
[18:29:07] <andre__>	 Heh. Google once had "my other machine is a datacenter" stickers.
[18:30:32] <icinga-wm>	 RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[18:31:16] <legoktm>	 AaronSchulz: yeah I can do it
[18:32:06] <legoktm>	 twentyafterfour: have you started the train yet?
[18:32:52] <grrrit-wm>	 (03CR) 10Dzahn: "let me add Moritz here first. we are in the process of switching to debdeploy for upgrades and i think this will conflict with alternate m" [puppet] - 10https://gerrit.wikimedia.org/r/243925 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar)
[18:33:10] <yurik>	 legoktm, git deploying maps, almost done
[18:33:59] <yurik>	 !log updated kartotherian & tilerator services
[18:34:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:34:10] <legoktm>	 ok
[18:34:15] <yurik>	 legoktm, done
[18:34:24] <legoktm>	 I'll assume twentyafterfour hasn't started yet
[18:34:34] <grrrit-wm>	 (03CR) 10Legoktm: [C: 032] Set "timeout" field for "memcached-pecl" explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249463 (owner: 10Aaron Schulz)
[18:34:50] <gwicke>	 !log restbase deploy done
[18:34:53] <icinga-wm>	 RECOVERY - Host labstore1002 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms
[18:34:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:34:56] <grrrit-wm>	 (03Merged) 10jenkins-bot: Set "timeout" field for "memcached-pecl" explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249463 (owner: 10Aaron Schulz)
[18:36:08] <legoktm>	 IOError: [Errno 2] No such file or directory: 'scap-masters'
[18:36:08] <legoktm>	 18:36:01 sync-file failed: <IOError> [Errno 2] No such file or directory: 'scap-masters'
[18:36:12] <legoktm>	 ostriches: ^ ?
[18:36:54] <ostriches>	 Blech.
[18:37:53] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)
[18:38:30] <legoktm>	 ostriches: is that a quick fix or should I revert the undeployed change?
[18:38:33] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Labs: reenable ldap on labstore* [puppet] - 10https://gerrit.wikimedia.org/r/249466 (https://phabricator.wikimedia.org/T116927) (owner: 10coren)
[18:38:53] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is inactive
[18:39:06] <ostriches>	 legoktm: Quick rollback, yes. But trebuchet stole ownership of some files and I can't!
[18:39:14] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is inactive
[18:39:18] <YuviPanda>	 hmm
[18:39:30] <YuviPanda>	 I assume those checks will go away once puppet runs
[18:39:32] <YuviPanda>	 on labstore1002
[18:39:35] <YuviPanda>	 since it's not the active host
[18:39:45] <YuviPanda>	 I'll look at puppet to make sure
[18:39:47] <Coren>	 Yep.  Been switched in hiera.
[18:39:48] <legoktm>	 ostriches: uh, so we need a root?
[18:40:07] <wikibugs>	 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1762788 (10jcrespo) Involving @Coren because of my last comment (this time, no pressure).
[18:40:19] <ostriches>	 legoktm: Worked around it.
[18:40:24] <ostriches>	 Yay git-fu!
[18:40:39] <legoktm>	 ok, should I try syncing now?
[18:40:49] <ostriches>	 1 sec....
[18:41:17] <ostriches>	 should be good
[18:41:44] <ostriches>	 !log rolled scap back to master@62a250a, needs puppet changes before new code goes live
[18:41:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:41:57] <grrrit-wm>	 (03PS3) 10Andrew Bogott: openstack: add links to docs for components, lint [puppet] - 10https://gerrit.wikimedia.org/r/249342 (owner: 10Dzahn)
[18:42:14] * legoktm tries again
[18:42:29] <logmsgbot>	 !log legoktm@tin Synchronized wmf-config/mc.php: https://gerrit.wikimedia.org/r/#/c/249463/ (duration: 00m 17s)
[18:42:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:42:37] <legoktm>	 ostriches: thanks
[18:42:40] <legoktm>	 AaronSchulz: log spam stopped
[18:43:16] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 031] openstack: add links to docs for components, lint [puppet] - 10https://gerrit.wikimedia.org/r/249342 (owner: 10Dzahn)
[18:47:42] <AaronSchulz>	 legoktm: thanks
[18:50:33] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki itwikiversity
[18:50:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:52:23] <grrrit-wm>	 (03PS1) 10Cmjohnson: Updating dhcp address for ms-be109 [puppet] - 10https://gerrit.wikimedia.org/r/249474 
[18:53:58] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Updating dhcp address for ms-be109 [puppet] - 10https://gerrit.wikimedia.org/r/249474 (owner: 10Cmjohnson)
[18:56:14] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407#1762833 (10chasemp) 5Open>3declined a:3chasemp >>! In T90407#1586541, @revi wrote: >>>! In T90407#1555920, @Dzahn wrote: >> Aren't all the public archives on gmane.org anyways an...
[18:58:16] <grrrit-wm>	 (03PS4) 10Dzahn: beta: point parsoid back to source code [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar)
[18:58:33] <icinga-wm>	 PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Puppet has 24 failures
[18:58:43] <YuviPanda>	 hmm
[18:58:53] <YuviPanda>	 so now codfw labstores also have LDAP
[18:59:01] <YuviPanda>	 which might or might not have been expected behavior
[18:59:14] * YuviPanda goes to look
[18:59:45] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1762849 (10chasemp) 5Open>3declined There aren''t resources or clear list participant interest across the scope of lists for this. We also don't have the manpower to keep...
[18:59:48] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "only touches the beta class (yes, separate from prod here as well) and per hashar already "cherry picked on integration puppetmaster"" [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar)
[19:00:17] <mutante>	 YuviPanda: that seems to be what Coren said in -labs
[19:00:46] <YuviPanda>	 mutante: I think it might've been unintentional since it also has admin
[19:02:47] <paravoid>	 YuviPanda: duh
[19:02:58] <paravoid>	 yeah we can put that include under an if $::site guard
[19:03:05] <paravoid>	 and undo the changes manually
[19:03:07] <YuviPanda>	 yeah
[19:03:12] <YuviPanda>	 after this dies down I guess
[19:04:11] <twentyafterfour>	 legoktm: no train yet, ready to go now if nothing is blocking it
[19:04:19] <legoktm>	 twentyafterfour: yeah, I finished
[19:04:21] <twentyafterfour>	 MaxSem: review coming up
[19:06:18] <jynus>	 there was like a huge spike of db errors on testwikidatawiki at 15:20-15:25, is this known?
[19:06:28] <Krenair>	 aude, ^
[19:06:48] <ebernhardson>	 should be, aude deployed then undeployed a patch in swat this morning
[19:07:12] <jynus>	 ok, it is a test wiki because of that
[19:07:42] <jynus>	 so all good
[19:09:07] <jynus>	 I see it on SAL, yes
[19:09:42] <grrrit-wm>	 (03PS1) 10BBlack: cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 
[19:10:18] <grrrit-wm>	 (03PS5) 10Ottomata: Abstract rsync classes into a define, correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal)
[19:10:25] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 (owner: 10BBlack)
[19:11:02] <YuviPanda>	 bd808: thcipriani new 'deployment started' gif? https://shogofawafa.files.wordpress.com/2015/08/tumblr_npnpkttr9b1tqtfrjo1_500.gif?w=700&h=438
[19:11:21] <bd808>	 no cats
[19:11:31] <YuviPanda>	 we can make that a pig
[19:11:46] <bd808>	 that would be acceptable :)
[19:11:47] <thcipriani>	 ^ seems like a good compromise.
[19:11:50] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Abstract rsync classes into a define, correct pageview to dumps synchro, add projectview [puppet] - 10https://gerrit.wikimedia.org/r/249378 (owner: 10Joal)
[19:12:10] <grrrit-wm>	 (03PS6) 10Ottomata: Aggregate from projectviews-*, not projectcounts-* [puppet] - 10https://gerrit.wikimedia.org/r/247458 (https://phabricator.wikimedia.org/T114379) (owner: 10Milimetric)
[19:12:17] <grrrit-wm>	 (03PS2) 10BBlack: cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 
[19:12:19] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Aggregate from projectviews-*, not projectcounts-* [puppet] - 10https://gerrit.wikimedia.org/r/247458 (https://phabricator.wikimedia.org/T114379) (owner: 10Milimetric)
[19:12:42] <grrrit-wm>	 (03PS2) 10Ottomata: Alert about the status of pageview and projectview [puppet] - 10https://gerrit.wikimedia.org/r/247608 (owner: 10Milimetric)
[19:12:53] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 (owner: 10BBlack)
[19:12:55] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Alert about the status of pageview and projectview [puppet] - 10https://gerrit.wikimedia.org/r/247608 (owner: 10Milimetric)
[19:14:09] <wikibugs>	 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1762943 (10chasemp) p:5Triage>3Normal
[19:14:49] <wikibugs>	 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1760888 (10chasemp) is this the actual access request or an outline for greater work?
[19:17:05] <grrrit-wm>	 (03PS1) 10Ottomata: Enable varnishreqstats on all misc and mobile hosts [puppet] - 10https://gerrit.wikimedia.org/r/249479 (https://phabricator.wikimedia.org/T83580) 
[19:18:22] <grrrit-wm>	 (03PS1) 10Jcrespo: Enable performance_schema on db1065 [puppet] - 10https://gerrit.wikimedia.org/r/249480 
[19:18:51] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1762964 (10chasemp) So this only applies to:  Researchers  https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/admin/data/data.yaml;cfa2ef23a4429ab1...
[19:19:32] <wikibugs>	 6operations, 7Database, 5Patch-For-Review: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1762965 (10jcrespo)
[19:19:51] <icinga-wm>	 PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:20:14] <grrrit-wm>	 (03PS2) 10Ottomata: Enable varnishreqstats on all misc and mobile hosts [puppet] - 10https://gerrit.wikimedia.org/r/249479 (https://phabricator.wikimedia.org/T83580) 
[19:20:36] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1762978 (10chasemp) https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/admin/data/data.yaml;cfa2ef23a4429ab12df33b0cd0...
[19:20:50] <grrrit-wm>	 (03PS5) 1020after4: Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem)
[19:20:51] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:20:55] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Nova.conf: Catch up with some changes to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/249482 (https://phabricator.wikimedia.org/T116935) 
[19:21:09] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:21:15] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Enable varnishreqstats on all misc and mobile hosts [puppet] - 10https://gerrit.wikimedia.org/r/249479 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata)
[19:21:31] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1762981 (10chasemp) >>! In T116487#1754582, @chasemp wrote: >>>! In T116487#1754144, @greg wrote: >> Jan: can you put in the description why you are requesting access? :) >  > Th...
[19:21:51] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Nova.conf: Catch up with some changes to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/249482 (https://phabricator.wikimedia.org/T116935) 
[19:21:55] <grrrit-wm>	 (03CR) 10Jcrespo: "This can be deployed at any time, as it requires mysql reboot to take effect, but lest wait for a better window for me." [puppet] - 10https://gerrit.wikimedia.org/r/249480 (owner: 10Jcrespo)
[19:22:01] <grrrit-wm>	 (03CR) 1020after4: [C: 032] "Straightforward enough." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem)
[19:22:19] <icinga-wm>	 PROBLEM - RAID on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:22:31] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint, 5Patch-For-Review: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1762982 (10chasemp)
[19:22:39] <icinga-wm>	 PROBLEM - Check size of conntrack table on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:22:40] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:22:59] <icinga-wm>	 PROBLEM - configured eth on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:22:59] <icinga-wm>	 PROBLEM - DPKG on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:23:10] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Nova.conf: Catch up with some changes to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/249482 (https://phabricator.wikimedia.org/T116935) (owner: 10Andrew Bogott)
[19:23:10] <icinga-wm>	 PROBLEM - dhclient process on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:23:14] <wikibugs>	 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1762984 (10Dzahn) it's the actual request, it's a follow-up to T115718 . that can be considered the parent task while this here is an addition that is needed and should have...
[19:23:20] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint, 5Patch-For-Review: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1473834 (10chasemp) There is no attached actionable access to grant here.  It seems we are in a discussion phase.  I am removing the tag for th...
[19:23:20] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:23:30] <icinga-wm>	 PROBLEM - puppet last run on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:23:34] <ottomata>	 hm
[19:23:39] <icinga-wm>	 PROBLEM - SSH on analytics1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:23:39] <wikibugs>	 10Ops-Access-Requests, 6operations: let datacenter-ops sign puppet certs and accept salt keys - https://phabricator.wikimedia.org/T116884#1762988 (10Dzahn)
[19:23:40] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1730969 (10Dzahn)
[19:23:40] <icinga-wm>	 PROBLEM - salt-minion processes on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:23:40] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:23:42] <wikibugs>	 6operations, 7Easy, 7HTTPS: WMF-Last-Access cookies doesn't set Secure flag - https://phabricator.wikimedia.org/T105451#1762990 (10chasemp) p:5Triage>3Normal
[19:23:51] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:23:51] <wikibugs>	 6operations, 10Traffic, 7Easy, 7HTTPS: WMF-Last-Access cookies doesn't set Secure flag - https://phabricator.wikimedia.org/T105451#1444142 (10chasemp)
[19:24:15] <ottomata>	 woo wee cluster is busy backfilling
[19:24:27] <ottomata>	 joal:  look at that busy hadoop
[19:24:27] <ottomata>	 http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Analytics%2520cluster%2520eqiad&tab=m&vn=&hide-hf=false
[19:24:40] <wikibugs>	 6operations, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1762997 (10chasemp) p:5Triage>3Normal
[19:24:49] <icinga-wm>	 PROBLEM - DPKG on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:24:49] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING
[19:24:49] <icinga-wm>	 PROBLEM - configured eth on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:24:50] <icinga-wm>	 PROBLEM - Check size of conntrack table on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:24:58] <wikibugs>	 6operations, 10Deployment-Systems: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1763007 (10chasemp) p:5Triage>3Normal
[19:25:02] <joal>	 That's what we want :) machines to do stuff ottomata :)
[19:25:13] <wikibugs>	 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1759278 (10chasemp)
[19:25:31] <wikibugs>	 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1759278 (10chasemp) at #release-Engineering-Team please advise :)
[19:25:47] <wikibugs>	 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: PID not expanded in heap dumps - https://phabricator.wikimedia.org/T116814#1763014 (10chasemp) p:5Triage>3Normal
[19:25:57] <joal>	 ottomata: anything you want me to stop ?
[19:25:57] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Split HTCP multicast addresses - https://phabricator.wikimedia.org/T116752#1763016 (10chasemp) p:5Triage>3Normal
[19:26:03] <ottomata>	 joal:  naw, its fine
[19:26:08] <wikibugs>	 6operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#1763020 (10chasemp) p:5Triage>3Normal
[19:26:17] <ottomata>	 we just get some (obviously) false alarms because icinga/nrpe timesout the checks
[19:26:17] <joal>	 ok cool :)
[19:26:19] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING
[19:26:19] <ottomata>	 when the cluster is busy
[19:26:20] <wikibugs>	 6operations: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#1763022 (10chasemp) p:5Triage>3Normal
[19:26:21] <icinga-wm>	 RECOVERY - DPKG on analytics1032 is OK: All packages OK
[19:26:29] <icinga-wm>	 RECOVERY - configured eth on analytics1032 is OK: OK - interfaces up
[19:26:29] <icinga-wm>	 RECOVERY - Check size of conntrack table on analytics1032 is OK: OK: nf_conntrack is 0 % full
[19:26:37] <wikibugs>	 6operations: Track amount of package updates on systems - https://phabricator.wikimedia.org/T116742#1763024 (10chasemp) p:5Triage>3Normal
[19:26:50] <wikibugs>	 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown, 10Wikidata: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1763027 (10chasemp) p:5Triage>3Normal
[19:26:50] <icinga-wm>	 PROBLEM - Disk space on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:27:45] <wikibugs>	 6operations, 10ops-eqiad, 5Continuous-Integration-Scaling: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1763049 (10chasemp) 5Open>3stalled Let's wait on this until we have a fully realized and migrated solution here just in case so we don't end up in a "oh wait...
[19:27:51] <wikibugs>	 6operations, 10ops-eqiad, 5Continuous-Integration-Scaling: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1763052 (10chasemp) p:5Triage>3Lowest
[19:29:38] <grrrit-wm>	 (03CR) 10Rush: "thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson)
[19:31:08] <grrrit-wm>	 (03PS1) 10Dzahn: admin: let dc-ops sign puppet certs, add salt keys [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) 
[19:31:21] <grrrit-wm>	 (03PS2) 10Yuvipanda: admin: Prune unused key [puppet] - 10https://gerrit.wikimedia.org/r/249468 
[19:31:23] <grrrit-wm>	 (03PS1) 10Yuvipanda: labstore: Include LDAP only in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/249484 
[19:32:00] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:33:49] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING
[19:34:09] <grrrit-wm>	 (03PS2) 10Dzahn: admin: let dc-ops sign puppet certs, add salt keys [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) 
[19:35:03] <wikibugs>	 10Ops-Access-Requests, 6operations: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1763098 (10Dzahn)
[19:35:33] <wikibugs>	 10Ops-Access-Requests, 6operations: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1730969 (10Dzahn) continued here: T116884: let datacenter-ops sign puppet certs and accept salt keys
[19:36:39] <twentyafterfour>	 Undefined index: timeout in /srv/mediawiki/php-1.27.0-wmf.4/includes/objectcache/MemcachedPeclBagOStuff.php on line 86
[19:36:55] <twentyafterfour>	 anyone know what might have changed that would cause the objectcache to get initialized without the timeout arg?
[19:38:33] <grrrit-wm>	 (03CR) 10Papaul: [C: 031] admin: let dc-ops sign puppet certs, add salt keys [puppet] - 10https://gerrit.wikimedia.org/r/249483 (https://phabricator.wikimedia.org/T116884) (owner: 10Dzahn)
[19:39:19] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] admin: Prune unused key [puppet] - 10https://gerrit.wikimedia.org/r/249468 (owner: 10Yuvipanda)
[19:39:30] <icinga-wm>	 RECOVERY - configured eth on analytics1038 is OK: OK - interfaces up
[19:39:30] <icinga-wm>	 RECOVERY - DPKG on analytics1038 is OK: All packages OK
[19:39:50] <icinga-wm>	 RECOVERY - Disk space on analytics1038 is OK: DISK OK
[19:39:50] <icinga-wm>	 RECOVERY - dhclient process on analytics1038 is OK: PROCS OK: 0 processes with command name dhclient
[19:40:11] <icinga-wm>	 RECOVERY - SSH on analytics1038 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[19:40:12] <icinga-wm>	 RECOVERY - salt-minion processes on analytics1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:40:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[19:40:30] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1038 is OK: DISK OK
[19:41:09] <icinga-wm>	 RECOVERY - Check size of conntrack table on analytics1038 is OK: OK: nf_conntrack is 0 % full
[19:41:10] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:41:50] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[19:41:59] <icinga-wm>	 RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 43 minutes ago with 0 failures
[19:42:30] <icinga-wm>	 RECOVERY - RAID on analytics1038 is OK: OK: optimal, 13 logical, 14 physical
[19:43:01] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING
[19:45:39] <icinga-wm>	 RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[19:48:29] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING
[19:49:28] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint, 5Patch-For-Review: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1763165 (10Yurik) @dzahn, I am very happy to get all the help I could get from Ops, but as our platform growth, each service will require ops a...
[19:50:14] <grrrit-wm>	 (03PS3) 10BBlack: cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 
[19:51:13] <twentyafterfour>	 ok train for group1 is about to go out
[19:51:20] <grrrit-wm>	 (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249486 
[19:51:31] <grrrit-wm>	 (03PS1) 10John F. Lewis: mailman: reject subscriptions from disabled list [puppet] - 10https://gerrit.wikimedia.org/r/249487 
[19:51:32] <wikibugs>	 6operations, 10Traffic, 10netops: drain ULSFO of all traffic on 2015-11-02 before 0900 PST - https://phabricator.wikimedia.org/T116928#1763167 (10faidon)
[19:51:45] <grrrit-wm>	 (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249486 (owner: 1020after4)
[19:51:51] <grrrit-wm>	 (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249486 (owner: 1020after4)
[19:52:04] <grrrit-wm>	 (03PS2) 10John F. Lewis: mailman: reject subscriptions from disabled list [puppet] - 10https://gerrit.wikimedia.org/r/249487 
[19:52:07] <logmsgbot>	 !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.4
[19:52:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:53:19] <grrrit-wm>	 (03PS4) 10BBlack: cipher_sim utility script for manual use [puppet] - 10https://gerrit.wikimedia.org/r/249477 
[19:54:41] <grrrit-wm>	 (03PS3) 10John F. Lewis: mailman: reject subscriptions from disabled list [puppet] - 10https://gerrit.wikimedia.org/r/249487 (https://phabricator.wikimedia.org/T116560) 
[19:54:53] <grrrit-wm>	 (03PS4) 10John F. Lewis: mailman: reject subscriptions from disabled list [puppet] - 10https://gerrit.wikimedia.org/r/249487 (https://phabricator.wikimedia.org/T116560) 
[19:56:30] <wikibugs>	 6operations, 10Traffic, 10netops: drain ULSFO of all traffic on 2015-11-02 before 0900 PST - https://phabricator.wikimedia.org/T116928#1763183 (10faidon) Confirmed that 2015-11-02 17:00 UTC works.  I'll depool ulsfo a few hours earlier. Please confirm with me before the start of the window and beginning the...
[20:00:01] <grrrit-wm>	 (03PS5) 10John F. Lewis: mailman: reject subscriptions to disabled list [puppet] - 10https://gerrit.wikimedia.org/r/249487 (https://phabricator.wikimedia.org/T116560) 
[20:00:04] <jouncebot>	 gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T2000).
[20:00:28] <subbu>	 nothing to deploy today for parsoid
[20:00:40] <bearND>	 no mobileapps deploy today
[20:03:09] <grrrit-wm>	 (03CR) 10Rush: [C: 032] "seems sound" [puppet] - 10https://gerrit.wikimedia.org/r/249487 (https://phabricator.wikimedia.org/T116560) (owner: 10John F. Lewis)
[20:03:14] <grrrit-wm>	 (03PS1) 10Merlijn van Deen: toollabs: use 'fastapt' provider for packages [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) 
[20:03:17] <valhallasw`cloud>	 YuviPanda: ^ :-)
[20:03:35] <valhallasw`cloud>	 (will probably get a -1 from jenkins, but hey)
[20:04:18] <YuviPanda>	 :D
[20:05:04] <YuviPanda>	 valhallasw`cloud: we should probably not make it default for anything, and explicitly specify provider in both exec and dev environ to start with
[20:05:25] <valhallasw`cloud>	 Package { :provider => 'fastapt' } ? sure.
[20:05:34] <YuviPanda>	 yeah
[20:05:46] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407#1763207 (10Dzahn) I would like to add this:  I went to the upstream mailman IRC channel and asked about a feature that just let's list admins download the .mbox files of their own list...
[20:05:54] <valhallasw`cloud>	 oh, that also makes it easier to test on toolsbeta
[20:05:56] <valhallasw`cloud>	 ok, good idea
[20:07:27] <wikibugs>	 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1763210 (10saper) As per https://lists.wikimedia.org/pipermail/wikitech-l/2015-October/083752.html I'd like to sort #mediawiki-installer bugs...
[20:08:00] <valhallasw`cloud>	 let me also add some docs
[20:09:20] <YuviPanda>	 valhallasw`cloud: \o/ cool
[20:09:23] <grrrit-wm>	 (03PS1) 10Hashar: cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) 
[20:10:49] <wikibugs>	 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1763228 (10chasemp) @saper  sprint projects specifically being short lived and hopefully archived after a set period have been a bit of a grey...
[20:10:51] <grrrit-wm>	 (03CR) 10Hashar: "I am not sure how many metrics it is going to add to labmon1001.eqiad.wmnet :-\" [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar)
[20:11:05] <wikibugs>	 6operations: Reprepro should bail if it can't read and sign using the root keys - https://phabricator.wikimedia.org/T116951#1763230 (10MoritzMuehlenhoff) An annoying side aspect is that the files get still copied to the pool, but the Packages file isn't updated along.
[20:11:09] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet).
[20:11:39] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet).
[20:12:51] <wikibugs>	 6operations, 10ops-eqiad, 5Continuous-Integration-Scaling: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#1763238 (10hashar) Good call. We never know :-)
[20:13:11] <grrrit-wm>	 (03PS1) 10coren: Labs: remove admin from labstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/249493 
[20:13:15] <Coren>	 YuviPanda: ^^
[20:13:46] <YuviPanda>	 chasemp: paravoid ^
[20:13:55] <YuviPanda>	 also I think that'll need manual cleanup too
[20:14:27] <valhallasw`cloud>	 YuviPanda: I suggest testing on toolsbeta first
[20:14:29] <Coren>	 Aw, you're probably right.  I'm pretty sure that there is no provision to => absent the passwd entries.
[20:14:39] <YuviPanda>	 Coren: no
[20:14:54] <chasemp>	 YuviPanda: Coren is this still the problem I proposed https://docs.puppetlabs.com/references/latest/type.html#user-attribute-forcelocal for previously?
[20:14:56] <paravoid>	 it's more complicated than that
[20:15:01] <paravoid>	 # FIXME: this is an intentional hard stop as before T84032
[20:15:02] <paravoid>	 if [[ `hostname -s` =~ ^labstore100 ]]; then exit 1
[20:15:02] <paravoid>	 fi
[20:15:25] <Coren>	 Ah.
[20:15:27] <paravoid>	 are we ever realistically going to use 2001 as an NFS server to labs-eqiad?
[20:15:39] <YuviPanda>	 I don't actually think so.
[20:15:43] <YuviPanda>	 it's just our backup dest
[20:15:46] <paravoid>	 right
[20:15:48] <YuviPanda>	 should probably not even be using the same roles
[20:15:54] <YuviPanda>	 or be called the same thing
[20:15:54] <chasemp>	 we should strive not to tbh
[20:15:55] <paravoid>	 so let's just remove all this ldap crap
[20:16:00] <YuviPanda>	 yeah
[20:16:02] <Coren>	 paravoid: I don't believe we will.  It's a backup target, and may become useful if we ever have labs-like in codfw
[20:16:17] <YuviPanda>	 so that's what https://gerrit.wikimedia.org/r/#/c/249484/ does but I'm not sure what all needs to be manually removed
[20:16:35] <Coren>	 YuviPanda: I've pulled out a list of packages in -operations not long ago.
[20:16:50] <grrrit-wm>	 (03PS2) 10Merlijn van Deen: toollabs: install 'fastapt' provider for packages [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) 
[20:16:56] <paravoid>	 better check 'grep puppet /var/log/syslog' for what exactly it did and undo it
[20:16:59] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers for Addshore - https://phabricator.wikimedia.org/T116784#1763240 (10JanZerebecki) Yes. (It is a superset of the group statistics-users.)
[20:16:59] <grrrit-wm>	 (03Abandoned) 10coren: Labs: remove admin from labstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/249493 (owner: 10coren)
[20:17:45] <wikibugs>	 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1763243 (10hashar) The original commit is from Nov 30, 2012. I think at one point the idea was to publish the state of the repos on the de...
[20:18:53] <YuviPanda>	 ah yeah I see diffs
[20:18:56] <YuviPanda>	 so we can undo those
[20:19:00] <YuviPanda>	 not sure what to do about the local accounts
[20:19:10] <paravoid>	 the local accounts were there before
[20:19:12] <paravoid>	 and should stay
[20:19:16] <chasemp>	 there is a last run report in /var/lib/puppet
[20:19:22] <chasemp>	 I tink
[20:19:24] <grrrit-wm>	 (03CR) 10coren: [C: 031] "Yep." [puppet] - 10https://gerrit.wikimedia.org/r/249484 (owner: 10Yuvipanda)
[20:19:28] <YuviPanda>	 oh right ofc
[20:20:05] <wikibugs>	 6operations, 10Deployment-Systems, 6Release-Engineering-Team: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1763254 (10chasemp) 5Open>3Resolved a:3chasemp
[20:22:01] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 7Blocked-on-RelEng, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1763256 (10chasemp)
[20:23:25] <YuviPanda>	 chasemp: yea but there's been more runs since then I guess
[20:23:32] <YuviPanda>	 chasemp: I found diffs in /var/log/puppet.log
[20:23:46] <wikibugs>	 6operations: upgrade radium to jessie - https://phabricator.wikimedia.org/T116963#1763259 (10Dzahn) 3NEW a:3Dzahn
[20:23:54] <wikibugs>	 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1763266 (10saper) In general I find it difficult to grasp the project with more than > 50 tasks on the board. (Maybe others can:)  By grasping...
[20:24:15] <wikibugs>	 6operations: build newer tor packages - https://phabricator.wikimedia.org/T116964#1763267 (10Dzahn) 3NEW a:3Dzahn
[20:27:39] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[20:28:06] <mutante>	 hashar: "end of the day" extended?:)
[20:28:19] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[20:28:21] <grrrit-wm>	 (03PS2) 10Yuvipanda: labstore: Include LDAP only in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/249484 
[20:29:03] <YuviPanda>	 !log disable puppet on labstore1001
[20:29:06] <hashar>	 mutante: yeah now it is the start of the evening :-}
[20:29:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:29:30] <wikibugs>	 6operations, 10Architecture, 10Incident-20150423-Commons, 10MediaWiki-RfCs, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1763296 (10Krinkle)
[20:29:32] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] labstore: Include LDAP only in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/249484 (owner: 10Yuvipanda)
[20:29:51] <mutante>	 hashar: should we do https://gerrit.wikimedia.org/r/#/c/244498/ ?
[20:30:02] <mutante>	 because of the restart part etc
[20:30:14] <mutante>	 "reload of the replication plugin"
[20:31:40] <grrrit-wm>	 (03CR) 10Dereckson: "Currently, every key of this array are in CommonSettings.php, except one in proofreadpage.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson)
[20:39:18] <grrrit-wm>	 (03PS4) 10Dzahn: openstack: add links to docs for components, lint [puppet] - 10https://gerrit.wikimedia.org/r/249342 
[20:40:44] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "most of it are only comments, and the rest are harmless changes like whitespace and alignment. it does fix a whole bunch of warnings thoug" [puppet] - 10https://gerrit.wikimedia.org/r/249342 (owner: 10Dzahn)
[20:46:11] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:46:43] <grrrit-wm>	 (03CR) 10Krinkle: "Hm.. wouldn't the following work without the additional non-standard wmg logic?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson)
[20:47:59] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING
[20:48:24] <grrrit-wm>	 (03PS1) 10Yurik: Allow same perms to tileratorui as tilerator [puppet] - 10https://gerrit.wikimedia.org/r/249501 
[20:55:48] <YuviPanda>	 !log reverted changes to nsswitch.conf from puppet run manually on labstore2001
[20:55:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:57:00] <icinga-wm>	 PROBLEM - Check size of conntrack table on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:57:11] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:57:41] <legoktm>	 jouncebot: next
[20:57:41] <jouncebot>	 In 2 hour(s) and 2 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T2300)
[20:59:19] <icinga-wm>	 RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:01:02] <icinga-wm>	 PROBLEM - DPKG on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:01:02] <icinga-wm>	 PROBLEM - configured eth on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:01:21] <icinga-wm>	 PROBLEM - Disk space on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:01:21] <icinga-wm>	 PROBLEM - dhclient process on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:01:39] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:01:50] <icinga-wm>	 PROBLEM - puppet last run on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:01:51] <icinga-wm>	 PROBLEM - salt-minion processes on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:01:51] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:01:59] <icinga-wm>	 PROBLEM - SSH on analytics1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:02:01] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:02:39] <icinga-wm>	 RECOVERY - Check size of conntrack table on analytics1038 is OK: OK: nf_conntrack is 0 % full
[21:02:50] <icinga-wm>	 RECOVERY - configured eth on analytics1038 is OK: OK - interfaces up
[21:02:50] <icinga-wm>	 RECOVERY - DPKG on analytics1038 is OK: All packages OK
[21:03:09] <icinga-wm>	 RECOVERY - Disk space on analytics1038 is OK: DISK OK
[21:03:10] <icinga-wm>	 RECOVERY - dhclient process on analytics1038 is OK: PROCS OK: 0 processes with command name dhclient
[21:03:20] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[21:03:30] <icinga-wm>	 RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 35 minutes ago with 0 failures
[21:03:39] <icinga-wm>	 RECOVERY - salt-minion processes on analytics1038 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[21:03:40] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[21:06:39] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING
[21:07:59] <icinga-wm>	 PROBLEM - RAID on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:08:21] <YuviPanda>	 !log enable puppet on labstore1001
[21:08:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:09:20] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1038 is OK: DISK OK
[21:09:40] <icinga-wm>	 RECOVERY - RAID on analytics1038 is OK: OK: optimal, 13 logical, 14 physical
[21:10:11] <wikibugs>	 6operations, 6Labs: Cleanup / clarify labstore2001 - https://phabricator.wikimedia.org/T116972#1763469 (10yuvipanda) 3NEW
[21:10:44] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 031] admin: hoo and jzerebecki for wdqs admins [puppet] - 10https://gerrit.wikimedia.org/r/249027 (https://phabricator.wikimedia.org/T116702) (owner: 10Dzahn)
[21:11:01] <icinga-wm>	 RECOVERY - SSH on analytics1038 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[21:11:33] <logmsgbot>	 !log legoktm@tin Synchronized php-1.27.0-wmf.4/includes/changes/EnhancedChangesList.php: Fix diff/history links not showing up for ungrouped enhanced RC - https://gerrit.wikimedia.org/r/#/c/249556/ (duration: 00m 19s)
[21:11:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:13:39] <grrrit-wm>	 (03CR) 10Dzahn: "i don't know how we can achieve these goals at the same time: a) kill global ./files/ b) move everything into a module structure c) not" [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn)
[21:16:25] <grrrit-wm>	 (03Abandoned) 10Dzahn: (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn)
[21:17:41] <grrrit-wm>	 (03Abandoned) 10Dzahn: logstash: move files from ./files to module [puppet] - 10https://gerrit.wikimedia.org/r/249062 (owner: 10Dzahn)
[21:20:23] <grrrit-wm>	 (03CR) 10Dzahn: "ok, after reading the comments on the other similar change... move them all into the role module is the right answer i suppose" [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn)
[21:23:48] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1763543 (10MeganHernandez_WMF) We have used landing page impressions in the past and I think we do use this for e...
[21:30:03] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Split HTCP multicast addresses - https://phabricator.wikimedia.org/T116752#1763580 (10BBlack) Just status update on where this is at: maps is ready to take off using its own multicast address.  Upload is listening to both its old (legacy shared) and new multicast fo...
[21:33:20] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, and 3 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1763585 (10BBlack) p:5Triage>3Normal
[21:33:54] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, and 3 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1763588 (10BBlack) a:5BBlack>3Yurik re-assigning to Yuri to actually implement the HTCP-sending part so we can observe whether it works
[21:34:19] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Split HTCP multicast addresses - https://phabricator.wikimedia.org/T116752#1763596 (10BBlack)
[21:34:20] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, and 3 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1763595 (10BBlack)
[21:34:23] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "This makes the setup more complicated for 50 seconds saved time. Puppet is run automatically in the background and its runtime not that i" [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) (owner: 10Merlijn van Deen)
[21:34:38] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1763599 (10Nemo_bis) 5declined>3Open There is no need of ops manpower for this task.
[21:34:42] <bawolff>	 Would anyone be able to tell me what the version of the vips binary is on the image scalars?
[21:35:46] <legoktm>	 bawolff: which servers are those?
[21:36:42] <bawolff>	 mw1153, mw1154 I think
[21:37:21] <legoktm>	 mw1153 is role::mediawiki::imagescaler
[21:37:30] <legoktm>	 legoktm@mw1153:~$ vips --version
[21:37:30] <legoktm>	 vips-7.38.5-Sat Apr  5 11:17:49 UTC 2014
[21:38:31] <bawolff>	 Thanks
[21:38:45] <bawolff>	 oddly enough, that's the same version I have, but yet everything works on my computer
[21:38:55] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1763618 (10Ottomata) >> If we adopt a convention of always storing schema name and/or revision in the schemas themselves, then we can do like EventLo...
[21:39:41] <wikibugs>	 6operations, 5Continuous-Integration-Scaling: Allow network flow between labs instance and scandium - https://phabricator.wikimedia.org/T116975#1763623 (10hashar) 3NEW a:3hashar
[21:41:54] <grrrit-wm>	 (03PS5) 10Dzahn: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 
[21:43:06] <grrrit-wm>	 (03CR) 10Dzahn: "@Alex better like this?" [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn)
[21:50:09] <grrrit-wm>	 (03PS7) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 
[21:51:04] <grrrit-wm>	 (03PS8) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 
[21:52:35] <grrrit-wm>	 (03PS9) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 
[21:54:15] <grrrit-wm>	 (03PS1) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) 
[21:55:01] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) (owner: 10JanZerebecki)
[21:57:29] <grrrit-wm>	 (03PS1) 10BBlack: cipher_sim: fix match ordering (server pref first) [puppet] - 10https://gerrit.wikimedia.org/r/249577 
[21:57:51] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] cipher_sim: fix match ordering (server pref first) [puppet] - 10https://gerrit.wikimedia.org/r/249577 (owner: 10BBlack)
[21:58:03] <grrrit-wm>	 (03PS4) 10MaxSem: Beta: use final agreed upon deployment scheme [puppet] - 10https://gerrit.wikimedia.org/r/248374 
[21:58:11] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:00:01] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING
[22:03:30] <grrrit-wm>	 (03PS2) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) 
[22:09:48] <wikibugs>	 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1763832 (10mmodell) #scap3 is going to provide a multi-paned terminal layout which monitors logstash in the same terminal window that is running scap. It will be a tremendous improvement over `fatalmonitor` comm...
[22:19:36] <grrrit-wm>	 (03CR) 10Merlijn van Deen: "I agree it's not that important for noninteractive use. I'm more thinking of the test/deploy/revert cases, where reducing the time with 50" [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) (owner: 10Merlijn van Deen)
[22:21:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:21:30] <icinga-wm>	 PROBLEM - SSH on analytics1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:21:30] <icinga-wm>	 PROBLEM - RAID on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:22:00] <icinga-wm>	 PROBLEM - configured eth on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:22:01] <icinga-wm>	 PROBLEM - DPKG on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:22:10] <icinga-wm>	 PROBLEM - dhclient process on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:22:10] <icinga-wm>	 PROBLEM - Check size of conntrack table on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:22:10] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:22:25] <grrrit-wm>	 (03PS1) 10John F. Lewis: mailman: fix syntax in disable_list ban_list echo [puppet] - 10https://gerrit.wikimedia.org/r/249584 
[22:22:36] <grrrit-wm>	 (03PS2) 10John F. Lewis: mailman: fix syntax in disable_list ban_list echo [puppet] - 10https://gerrit.wikimedia.org/r/249584 
[22:22:36] <chasemp>	 ottomata: analytics1032?
[22:22:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[22:23:11] <icinga-wm>	 RECOVERY - SSH on analytics1032 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[22:23:11] <icinga-wm>	 RECOVERY - RAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical
[22:23:41] <icinga-wm>	 RECOVERY - configured eth on analytics1032 is OK: OK - interfaces up
[22:23:49] <icinga-wm>	 RECOVERY - DPKG on analytics1032 is OK: All packages OK
[22:23:50] <icinga-wm>	 RECOVERY - dhclient process on analytics1032 is OK: PROCS OK: 0 processes with command name dhclient
[22:23:51] <icinga-wm>	 RECOVERY - Check size of conntrack table on analytics1032 is OK: OK: nf_conntrack is 0 % full
[22:24:00] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING
[22:28:24] <wikibugs>	 6operations: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1763907 (10Krenair) 3NEW
[22:33:38] <grrrit-wm>	 (03PS3) 10John F. Lewis: mailman: fix syntax in disable_list ban_list echo [puppet] - 10https://gerrit.wikimedia.org/r/249584 (https://phabricator.wikimedia.org/T116560) 
[22:34:29] <grrrit-wm>	 (03CR) 10Rush: [C: 032] mailman: fix syntax in disable_list ban_list echo [puppet] - 10https://gerrit.wikimedia.org/r/249584 (https://phabricator.wikimedia.org/T116560) (owner: 10John F. Lewis)
[22:35:10] <grrrit-wm>	 (03PS3) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) 
[22:36:37] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) (owner: 10JanZerebecki)
[22:38:20] <urandom>	 !log Cassandra cleanup on restbase-test2001-a
[22:38:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:38:41] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1763953 (10GWicke) @ottomata, I think understanding the semantics of an event primarily requires knowledge of the topic. The topic in turn provides a...
[22:42:16] <wikibugs>	 6operations: Track amount of package updates on systems - https://phabricator.wikimedia.org/T116742#1763985 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff
[22:50:23] <grrrit-wm>	 (03PS4) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) 
[22:51:49] <icinga-wm>	 PROBLEM - salt-minion processes on labvirt1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[22:57:25] <logmsgbot>	 !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/CirrusSearch/includes/Hooks.php: I0e5f2d3b2 (duration: 00m 18s)
[22:57:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:59:44] <AaronSchulz>	 ori: https://gerrit.wikimedia.org/r/#/c/249197/
[23:00:04] <jouncebot>	 RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151028T2300).
[23:00:05] <jouncebot>	 ebernhardson jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:30] <Krenair>	 busy with uni stuff at the moment, can someone else take this?
[23:00:44] <ebernhardson>	 yea i can
[23:01:13] <jdlrobson>	 \o
[23:01:28] <ebernhardson>	 jdlrobson: your two patches, one is abandoned and the other has a -1
[23:01:43] <logmsgbot>	 !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/WikimediaEvents/WikimediaEventsHooks.php: I0e5f2d3b2 (duration: 00m 18s)
[23:01:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:02:28] <ebernhardson>	 jdlrobson: it also makes sense to me to apply that change from mediawiki-config, its the standard way. but up to you
[23:02:45] <jdlrobson>	 ebernhardson: oops lemme find the merged one
[23:03:44] <ebernhardson>	 MaxSem: you -2'd your patch, presumably to prevent merging?
[23:03:46] <jdlrobson>	 ebernhardson: so the first one should be swatted and for the -1 yeh a config change would be better
[23:04:11] <MaxSem>	 ebernhardson, removed
[23:04:12] <ebernhardson>	 jdlrobson: so i should un-abandon the patch?
[23:04:20] <jdlrobson>	 ebernhardson: but it's out of date  https://gerrit.wikimedia.org/r/#/c/249585/1/tests/browser/features/support/pages/article_page.rb
[23:04:29] <jdlrobson>	 i swatted too early.. :-S
[23:04:35] <ebernhardson>	 jdlrobson: ok, that makes sense now :)
[23:04:44] <jdlrobson>	 ebernhardson: https://gerrit.wikimedia.org/r/#/c/249579/ is the one that needs to be swatted
[23:05:51] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem)
[23:05:59] <grrrit-wm>	 (03Merged) 10jenkins-bot: Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) (owner: 10MaxSem)
[23:06:22] <ebernhardson>	 ori: done on tin?
[23:06:24] <jdlrobson>	 ebernhardson: want me to prepare the config change?
[23:06:29] <ori>	 yes
[23:06:29] <ebernhardson>	 jdlrobson: please
[23:08:19] <logmsgbot>	 !log ebernhardson@tin Synchronized portals: Switch www portals to be deployed from Git, but not being served from anywhere yet (duration: 00m 18s)
[23:08:21] <ebernhardson>	 MaxSem: ^^
[23:08:22] <grrrit-wm>	 (03PS1) 10Jdlrobson: Enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249603 (https://phabricator.wikimedia.org/T101719) 
[23:08:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:09:21] <ebernhardson>	 jdlrobson: which patch should be deployed first?
[23:10:04] <ebernhardson>	 jdlrobson: also, there might be something wrong with this commit message: https://gerrit.wikimedia.org/r/#/c/249603/
[23:10:44] <MaxSem>	 ebernhardson, lgtm, thanks
[23:11:13] <grrrit-wm>	 (03CR) 10Florianschmidtwelzow: [C: 04-1] Enable Wikidata descriptions (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249603 (https://phabricator.wikimedia.org/T101719) (owner: 10Jdlrobson)
[23:11:17] <jdlrobson>	 ebernhardson: MobileFrontned one
[23:11:32] <jdlrobson>	 ebernhardson: and wooaah on that commit msg wtf
[23:12:40] <grrrit-wm>	 (03PS2) 10Jdlrobson: Enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249603 (https://phabricator.wikimedia.org/T101719) 
[23:12:44] <ebernhardson>	 fwiw cirrus also typically just sets variables to specific values if its the same everywhere
[23:12:57] <ebernhardson>	 (in our case, things like list of endpoints to talk to, etc.)
[23:13:20] <jdlrobson>	 ebernhardson: not even sure how that happened
[23:14:09] <jdlrobson>	 ebernhardson: preferably config change can be deployed after i verify the MobileFrontend patch is working
[23:14:12] <jdlrobson>	 is that okay?
[23:14:20] <ebernhardson>	 jdlrobson: yea, just waiting on jenkins
[23:14:28] <ebernhardson>	 oh its done
[23:15:54] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.27.0-wmf.4/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/249585 (duration: 00m 19s)
[23:15:58] <ebernhardson>	 jdlrobson: ^^
[23:16:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:16:17] <jdlrobson>	 ebernhardson: sweet thx
[23:16:43] <jdlrobson>	 ebernhardson:mmm... not seeing it in production yet. guess i need to wait 5 mins
[23:16:50] <ebernhardson>	 for javascript, yea usually
[23:17:04] <jdlrobson>	 ebernhardson: works
[23:17:05] <jdlrobson>	 boom!
[23:17:07] <jdlrobson>	 thanx
[23:17:12] <jdlrobson>	 so the config change can now be applied :)
[23:17:32] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249603 (https://phabricator.wikimedia.org/T101719) (owner: 10Jdlrobson)
[23:17:39] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249603 (https://phabricator.wikimedia.org/T101719) (owner: 10Jdlrobson)
[23:18:45] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/249603 (duration: 00m 17s)
[23:18:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:19:14] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/mobile.php: https://gerrit.wikimedia.org/r/249603 (duration: 00m 18s)
[23:19:18] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "it looks reasonable since they are the same permissions as for the other service and these services belong together. i think we have to tr" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (owner: 10Yurik)
[23:19:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:19:50] <AaronSchulz>	 ebernhardson: ori and I would look to do https://gerrit.wikimedia.org/r/#/c/249605/ soonish, let me know when thats ok
[23:19:51] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/249603 (duration: 00m 17s)
[23:19:52] <ebernhardson>	 jdlrobson: ok should be out now
[23:20:11] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Drop cirrussearch write jobs after 3 hours of failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson)
[23:20:17] <jdlrobson>	 ebernhardson: not seeing it right now..
[23:20:35] <grrrit-wm>	 (03Merged) 10jenkins-bot: Drop cirrussearch write jobs after 3 hours of failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 (owner: 10EBernhardson)
[23:21:18] <jdlrobson>	 ebernhardson: boom!
[23:21:18] <jdlrobson>	 working
[23:21:23] <jdlrobson>	 thanks a bunch. you're a superstar
[23:21:25] <jdlrobson>	 that's all from me! :)
[23:21:26] <grrrit-wm>	 (03CR) 10Yurik: "Dzahn, the tilerator was one service and was simply broken apart to simplify management. Its the same code repo. When i was making the pat" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (owner: 10Yurik)
[23:21:43] <grrrit-wm>	 (03CR) 10Dzahn: "@chasemp: i think we should add this one to the next meeting too, with the access requests. do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (owner: 10Yurik)
[23:21:52] <ebernhardson>	 AaronSchulz: sure done soon
[23:22:07] <grrrit-wm>	 (03CR) 10Rush: "task association?" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (owner: 10Yurik)
[23:22:44] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: (no message) (duration: 00m 17s)
[23:22:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:23:58] <grrrit-wm>	 (03PS2) 10Yurik: Allow same perms to tileratorui as tilerator [puppet] - 10https://gerrit.wikimedia.org/r/249501 (https://phabricator.wikimedia.org/T112914) 
[23:25:41] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: (no message) (duration: 00m 17s)
[23:25:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:27:59] <grrrit-wm>	 (03PS10) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 
[23:29:19] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.27.0-wmf.4/extensions/WikimediaEvents/WikimediaEvents.php: https://gerrit.wikimedia.org/r/249642 (duration: 00m 17s)
[23:29:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:29:29] <ebernhardson>	 AaronSchulz: ok should be done now
[23:30:26] <grrrit-wm>	 (03PS11) 10Dzahn: lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 
[23:30:48] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] lint: double quoted strings pt.1 [puppet] - 10https://gerrit.wikimedia.org/r/243852 (owner: 10Dzahn)
[23:31:05] <ori>	 ebernhardson: thanks
[23:31:16] <logmsgbot>	 !log ori@tin Synchronized php-1.27.0-wmf.4/extensions/Translate/tag/PageTranslationHooks.php: I0e5f2d3b2 (duration: 00m 18s)
[23:31:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:34:37] <wikibugs>	 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1764315 (10chasemp)
[23:34:43] <wikibugs>	 6operations: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1764316 (10Dzahn) I agree. I think we should:  - create a dedicated ganeti VM for people.wm and move it there - let all shell users have access to that ("comes with bastio...
[23:34:54] <wikibugs>	 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1433420 (10chasemp) >>! In T104967#1761540, @fgiunchedi wrote: > afaict our puppet hooks for jessie does include `thirdparty` >  > ``` >     pac...
[23:37:06] <mutante>	 JohnFLewis: i don't even see the fix in the diff :)  ban_list?
[23:37:14] <mutante>	 cool though
[23:38:57] <AaronSchulz>	 ebernhardson: not seeing the move error manifest anymore \o/
[23:39:01] <mutante>	 oh trailing ' .. that took me a while
[23:39:33] <ebernhardson>	 AaronSchulz: excellent, it would be nice to someday remove all those php4 vestiges, some day :)
[23:40:09] <AaronSchulz>	 the problem with the user arg still needs investigating
[23:40:29] <AaronSchulz>	 some sort of esoteric bug going on in there
[23:43:25] <ebernhardson>	 AaronSchulz: should probably merge https://gerrit.wikimedia.org/r/#/c/249644/ back then?
[23:43:37] <grrrit-wm>	 (03CR) 10Dzahn: "yea, but if you don't load the compat module (and if it disappears in a future release) then this would break and also working in 2.2 migh" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza)
[23:44:29] <ebernhardson>	 ah ok :)
[23:52:06] <grrrit-wm>	 (03PS1) 10Dzahn: puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 
[23:52:57] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 (owner: 10Dzahn)
[23:54:27] <grrrit-wm>	 (03CR) 10Dzahn: "hmm. there's 'No file(s) found for import of '../../../manifests/nagios.pp'" again" [puppet] - 10https://gerrit.wikimedia.org/r/249655 (owner: 10Dzahn)
[23:54:49] <grrrit-wm>	 (03PS2) 10Dzahn: puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 
[23:55:24] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 (owner: 10Dzahn)
[23:58:58] <grrrit-wm>	 (03PS1) 10Dzahn: etc,redis,dynamicproxy: fix some lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/249658