[00:00:05] <jouncebot>	 twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T0000).
[00:54:19] <wikibugs_>	 (03CR) 10Eevans: "> While I agree that no harm would come from increasing the map count" [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans)
[01:19:28] <icinga-wm>	 PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds
[01:19:29] <icinga-wm>	 PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds
[01:19:38] <icinga-wm>	 PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds
[01:19:38] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds
[01:19:38] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds
[01:20:09] <icinga-wm>	 PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds
[01:23:39] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds
[01:23:48] <wikibugs_>	 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4159963 (10Peachey88)
[01:25:29] <icinga-wm>	 RECOVERY - configured eth on stat1005 is OK: OK - interfaces up
[01:25:38] <icinga-wm>	 RECOVERY - Disk space on stat1005 is OK: DISK OK
[01:25:38] <icinga-wm>	 RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[01:25:38] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational
[01:26:09] <icinga-wm>	 RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient
[01:26:28] <icinga-wm>	 RECOVERY - DPKG on stat1005 is OK: All packages OK
[01:28:39] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:33:06] <wikibugs_>	 (03CR) 10MaxSem: [C: 031] Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt)
[01:35:00] <mutante>	 " Unable to run wmf-auto-reimage-host: Failed to puppet_generate_certs
[01:35:03] <mutante>	 :/
[02:02:09] <icinga-wm>	 PROBLEM - HHVM processes on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[02:02:10] <icinga-wm>	 PROBLEM - nutcracker port on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[02:03:49] <icinga-wm>	 PROBLEM - HHVM rendering on mw2163 is CRITICAL: connect to address 10.192.32.51 and port 80: Connection refused
[02:03:50] <icinga-wm>	 PROBLEM - nutcracker process on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[02:04:37] <mutante>	 ^ scheduling downtime
[02:05:17] <mutante>	 !log mw2163 through mw2166: since the wmf-auto-reimage failed after OS but before puppet run due to "Failed to puppet_generate_certs" i manually logged in with install-console and signed puppet certs (T174431)
[02:05:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:05:25] <stashbot>	 T174431: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431
[02:13:09] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524708783 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4778 keys, up 4 minutes 11 seconds - replication_delay is 1524708783
[02:14:09] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380
[02:14:19] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1524708856 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708436 keys, up 4 minutes 32 seconds - replication_delay is 1524708856
[02:15:40] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1524708936 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4707459 keys, up 4 minutes 46 seconds - replication_delay is 1524708936
[02:16:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524708967 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4778 keys, up 7 minutes 15 seconds - replication_delay is 1524708967
[02:16:49] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524708998 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4787 keys, up 4 minutes 45 seconds - replication_delay is 1524708998
[02:17:19] <icinga-wm>	 RECOVERY - HHVM processes on mw2163 is OK: PROCS OK: 1 process with command name hhvm
[02:17:20] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1524709036 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4710337 keys, up 4 minutes 31 seconds - replication_delay is 1524709036
[02:19:09] <icinga-wm>	 RECOVERY - HHVM rendering on mw2163 is OK: HTTP OK: HTTP/1.1 200 OK - 74979 bytes in 7.812 second response time
[02:19:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381
[02:20:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1524709206 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4606618 keys, up 4 minutes 20 seconds - replication_delay is 1524709206
[02:21:09] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1524709262 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 7845 keys, up 4 minutes 9 seconds - replication_delay is 1524709262
[02:21:19] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381
[02:24:09] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524709443 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5822 keys, up 2 minutes 4 seconds - replication_delay is 1524709443
[02:24:59] <icinga-wm>	 PROBLEM - Check systemd state on rdb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:28:23] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524709687 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5822 keys, up 6 minutes 8 seconds - replication_delay is 1524709687
[02:29:13] <icinga-wm>	 PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524709750 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4704 keys, up 3 minutes 59 seconds - replication_delay is 1524709750
[02:32:22] <icinga-wm>	 PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524709939 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 7346 keys, up 4 minutes 7 seconds - replication_delay is 1524709939
[02:32:23] <icinga-wm>	 PROBLEM - HHVM rendering on mw2164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:34:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw2166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:40:32] <icinga-wm>	 ACKNOWLEDGEMENT - Apache HTTP on mw2166 is CRITICAL: connect to address 10.192.32.54 and port 80: Connection refused daniel_zahn reinstall
[02:40:32] <icinga-wm>	 ACKNOWLEDGEMENT - Nginx local proxy to apache on mw2166 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.150 second response time daniel_zahn reinstall
[02:42:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw2166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 5.683 second response time
[02:44:42] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4707866 keys, up 31 minutes 51 seconds - replication_delay is 12
[02:45:32] <icinga-wm>	 RECOVERY - HHVM rendering on mw2164 is OK: HTTP OK: HTTP/1.1 200 OK - 75654 bytes in 5.214 second response time
[02:47:02] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[02:54:42] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 612 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4707866 keys, up 41 minutes 51 seconds - replication_delay is 612
[02:57:33] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 5492 keys, up 48 minutes 37 seconds - replication_delay is 21
[03:00:54] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708123 keys, up 51 minutes 3 seconds - replication_delay is 6
[03:01:04] <icinga-wm>	 PROBLEM - Host mw2163 is DOWN: PING CRITICAL - Packet loss = 100%
[03:01:45] <icinga-wm>	 RECOVERY - nutcracker port on mw2163 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[03:01:49] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2163 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reboot
[03:01:54] <icinga-wm>	 RECOVERY - Host mw2163 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
[03:02:34] <icinga-wm>	 RECOVERY - nutcracker process on mw2163 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker
[03:03:44] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8368 keys, up 46 minutes 44 seconds - replication_delay is 35
[03:05:45] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 5734 keys, up 50 minutes 14 seconds - replication_delay is 23
[03:07:44] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 631 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 5492 keys, up 58 minutes 47 seconds - replication_delay is 631
[03:08:54] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708205 keys, up 56 minutes 4 seconds - replication_delay is 48
[03:10:14] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[03:10:44] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381
[03:10:54] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 609 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708123 keys, up 1 hours 1 minutes - replication_delay is 609
[03:11:45] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 5633 keys, up 56 minutes 14 seconds - replication_delay is 45
[03:13:44] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 636 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8368 keys, up 56 minutes 45 seconds - replication_delay is 636
[03:13:54] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380
[03:14:54] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708175 keys, up 1 hours 2 minutes - replication_delay is 34
[03:16:34] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5366 keys, up 1 hours 4 minutes - replication_delay is 26
[03:21:54] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 649 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 5633 keys, up 1 hours 6 minutes - replication_delay is 649
[03:24:05] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708095 keys, up 1 hours 14 minutes - replication_delay is 41
[03:24:55] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 637 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708175 keys, up 1 hours 12 minutes - replication_delay is 637
[03:26:34] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 629 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5366 keys, up 1 hours 14 minutes - replication_delay is 629
[03:27:35] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 722.16 seconds
[03:30:04] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708079 keys, up 1 hours 17 minutes - replication_delay is 50
[03:34:05] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 639 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708095 keys, up 1 hours 24 minutes - replication_delay is 639
[03:38:05] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708131 keys, up 1 hours 28 minutes - replication_delay is 6
[03:40:04] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708079 keys, up 1 hours 27 minutes - replication_delay is 651
[03:44:05] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379
[03:44:34] <icinga-wm>	 RECOVERY - Check systemd state on rdb1004 is OK: OK - running: The system is fully operational
[03:45:05] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708099 keys, up 1 hours 35 minutes - replication_delay is 50
[03:51:44] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4606513 keys, up 1 hours 35 minutes - replication_delay is 29
[03:52:14] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708049 keys, up 1 hours 39 minutes - replication_delay is 0
[03:54:39] <_joe_>	 !log stopping redis replication from eqiad to codfw for the jobqueue cluster, we have an issue ongoing with CirrusSearch jobs and replication is broken
[03:54:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:55:14] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 654 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708099 keys, up 1 hours 45 minutes - replication_delay is 654
[03:59:05] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8021 keys, up 1 hours 42 minutes - replication_delay is 34
[04:01:44] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 631 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4606513 keys, up 1 hours 45 minutes - replication_delay is 631
[04:02:14] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 602 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708049 keys, up 1 hours 49 minutes - replication_delay is 602
[04:02:44] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4606513 keys, up 1 hours 46 minutes - replication_delay is 4
[04:04:05] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381
[04:05:05] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 7976 keys, up 1 hours 48 minutes - replication_delay is 35
[04:05:55] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5366 keys, up 44 seconds
[04:06:04] <icinga-wm>	 PROBLEM - confd service on rdb2005 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive
[04:06:14] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 5492 keys, up 1 minutes 1 seconds
[04:06:14] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 5633 keys, up 1 minutes 1 seconds
[04:06:15] <icinga-wm>	 PROBLEM - confd service on rdb2001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive
[04:06:15] <_joe_>	 expected ^^ I just stopped confd there
[04:06:33] <_joe_>	 in order to be able to disable replication
[04:06:44] <icinga-wm>	 PROBLEM - confd service on rdb2003 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive
[04:08:24] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708049 keys, up 1 minutes 6 seconds
[04:08:24] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708099 keys, up 1 minutes 7 seconds
[04:09:15] <icinga-wm>	 PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481
[04:09:24] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479
[04:09:24] <icinga-wm>	 PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6480
[04:09:34] <_joe_>	 wat
[04:11:15] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5822 keys, up 6 minutes 6 seconds
[04:12:24] <icinga-wm>	 PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6480
[04:12:24] <icinga-wm>	 PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481
[04:13:24] <icinga-wm>	 RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4704 keys, up 8 minutes 9 seconds
[04:13:24] <icinga-wm>	 RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 7138 keys, up 8 minutes 9 seconds
[04:13:24] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5162 keys, up 1 hours 59 minutes
[04:13:54] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4707459 keys, up 2 hours 3 minutes
[04:24:30] <wikibugs_>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4160039 (10Smalyshev) 05Open>03Resolved
[04:29:54] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 187.18 seconds
[04:31:08] <_joe_>	 SMalyshev: around?
[04:31:33] <_joe_>	 if so, can you ping ebernhardson please? we have what I think are huge issues with the elasticsearch cluster in codfw
[04:31:43] <SMalyshev>	 yes here
[04:32:03] <_joe_>	 It's very early here and getting on the phone would wake up everyone in the house
[04:32:13] <_joe_>	 I'm opening a UBN! ticket now
[04:32:14] <SMalyshev>	 unfortunately I don't know any way to get to ebernhardson except IRC/email...
[04:32:27] <_joe_>	 the office wiki contact list I guess 
[04:32:29] <SMalyshev>	 maybe gehel is online or will be soon?
[04:32:35] <SMalyshev>	 ah, I'll try
[04:33:03] <_joe_>	 heh I guess guillame will be around in a few hours
[04:33:10] <_joe_>	 it's 6.33 AM here 
[04:36:04] <SMalyshev>	 _joe_: pinged, he's on his way
[04:36:11] <_joe_>	 thanks
[04:36:31] <wikibugs_>	 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Search-Platform-Programs: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4160062 (10Joe)
[04:38:13] <icinga-wm>	 ACKNOWLEDGEMENT - confd service on rdb2001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto T193112
[04:38:13] <icinga-wm>	 ACKNOWLEDGEMENT - confd service on rdb2003 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto T193112
[04:38:13] <icinga-wm>	 ACKNOWLEDGEMENT - confd service on rdb2005 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto T193112
[04:39:49] <ebernhardson>	 !log unfreeze writes to elasticsearch codfw cluster
[04:39:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:44:41] <wikibugs_>	 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Search-Platform-Programs: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4160076 (10EBernhardson) It looks like writes were frozen to the codfw clust...
[04:49:27] <wikibugs_>	 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Search-Platform-Programs: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4160077 (10EBernhardson) I suppose we should lower the drop timeout, in `$wg...
[05:03:14] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0
[05:03:44] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[05:18:03] <marostegui>	 !log Deploy schema change on dbstore1002:s2 - T191519 T188299 T190148
[05:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:18:11] <stashbot>	 T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519
[05:18:11] <stashbot>	 T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148
[05:18:11] <stashbot>	 T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299
[05:19:24] <icinga-wm>	 RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0
[05:19:55] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[05:24:00] <Volker_E>	 zeljkof: You've posted on sev MMV patches, that jenkins failure comes from (now) merged task. Jenkins still fails, f.e.  on https://gerrit.wikimedia.org/r/364175
[05:32:06] <librenms-wmf>	 08Warning Alert for device cr1-ulsfo.wikimedia.org - Inbound interface errors
[05:32:58] <wikibugs_>	 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Search-Platform-Programs: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4160124 (10Joe) The queue is getting back to normal sizes, and the job produ...
[05:33:05] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[05:33:34] <legoktm>	 oh wow that yellow is bright
[05:33:35] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0
[05:39:39] <wikibugs_>	 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4121650 (10Legoktm) >>! In T191921#4122327, @Joe wrote: > What are the blockers for the use of PHP7? >  > All I see on the ticket mentioned is the memc...
[05:42:06] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr1-ulsfo.wikimedia.org recovered from Inbound interface errors
[06:02:15] <wikibugs_>	 (03PS1) 10Marostegui: s1.hosts: Add db1116:3311 [software] - 10https://gerrit.wikimedia.org/r/429135 (https://phabricator.wikimedia.org/T190704)
[06:04:08] <zeljkof>	 Volker_E: I think that problem is resolved
[06:04:44] <zeljkof>	 Well, is there a problem, or are my comments confusing?
[06:07:51] <ema>	 nice, we have librenms reporting here now
[06:09:00] <Volker_E>	 zeljkof: I hoped for your help hunting down, what's currently stopping jenkins…
[06:11:54] <zeljkof>	 Volker_E: on the phone now, will check in an hour or so
[06:12:26] <Volker_E>	 zeljkof: great, not pressing
[06:19:17] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=elasticsearch
[06:19:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:44] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[06:51:19] <wikibugs_>	 10Operations, 10Discovery, 10Discovery-Analysis, 10Product-Analytics, 10Discovery-Search (Current work): Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967#4160213 (10Gehel) a:05Gehel>03None
[07:02:17] <wikibugs_>	 (03PS1) 10Muehlenhoff: Remove Madhu from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/429142
[07:04:05] <icinga-wm>	 RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0
[07:04:14] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[07:04:51] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Depool db1090, repool db1122 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429143 (https://phabricator.wikimedia.org/T192979)
[07:05:42] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Remove Madhu from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/429142 (owner: 10Muehlenhoff)
[07:07:45] <wikibugs_>	 (03PS1) 10Muehlenhoff: Remove access credentials for Madhu [puppet] - 10https://gerrit.wikimedia.org/r/429144
[07:10:58] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for Madhu [puppet] - 10https://gerrit.wikimedia.org/r/429144 (owner: 10Muehlenhoff)
[07:15:20] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1090, repool db1122 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429143 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[07:16:43] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Depool db1090, repool db1122 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429143 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[07:16:52] <gehel>	 !log re-enabling puppet on rdb2* - T193112
[07:16:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:58] <stashbot>	 T193112: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112
[07:19:42] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Depool db1090, repool db1122 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429143 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[07:20:18] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1090, pool db1122 with full weight (duration: 01m 23s)
[07:20:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:38] <gehel>	 !log restarting redis masters in codfw - T193112
[07:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:50] <icinga-wm>	 RECOVERY - confd service on rdb2001 is OK: OK - confd is active
[07:23:19] <icinga-wm>	 RECOVERY - confd service on rdb2003 is OK: OK - confd is active
[07:24:39] <icinga-wm>	 RECOVERY - confd service on rdb2005 is OK: OK - confd is active
[07:25:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380
[07:25:35] <gehel>	 ^ that's my restart, should be back up already
[07:26:10] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3868 keys, up 1 minutes 37 seconds - replication_delay is 0
[07:27:29] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1524727642 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4606514 keys, up 33 seconds - replication_delay is 1524727642
[07:27:59] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524727674 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 5633 keys, up 24 seconds - replication_delay is 1524727674
[07:28:09] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5822 keys, up 3 hours 22 minutes
[07:28:09] <icinga-wm>	 PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4704 keys, up 3 hours 22 minutes
[07:28:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 6556 keys, up 3 hours 22 minutes
[07:28:19] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1524727687 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708049 keys, up 1 minutes 21 seconds - replication_delay is 1524727687
[07:28:19] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379
[07:28:39] <icinga-wm>	 PROBLEM - Check health of redis instance on 6478 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6478 has 1 databases (db0) with 4 keys, up 3 hours 23 minutes
[07:28:39] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524727715 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5366 keys, up 31 seconds - replication_delay is 1524727715
[07:28:59] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524727734 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 5492 keys, up 53 seconds - replication_delay is 1524727734
[07:29:09] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4706278 keys, up 2 minutes 19 seconds - replication_delay is 0
[07:29:29] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4706329 keys, up 2 minutes 38 seconds - replication_delay is 0
[07:29:29] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4604947 keys, up 2 minutes 35 seconds - replication_delay is 0
[07:29:39] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3450 keys, up 1 minutes 31 seconds - replication_delay is 0
[07:29:59] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3791 keys, up 1 minutes 47 seconds - replication_delay is 0
[07:30:00] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3582 keys, up 1 minutes 53 seconds - replication_delay is 0
[07:36:19] <icinga-wm>	 RECOVERY - Check health of redis instance on 6478 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6478 has 1 databases (db0) with 4 keys, up 7 seconds - replication_delay is 6
[07:37:40] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3726 keys, up 1 minutes 22 seconds - replication_delay is 0
[07:38:40] <icinga-wm>	 RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3371 keys, up 2 minutes 19 seconds - replication_delay is 0
[07:38:40] <icinga-wm>	 RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 5133 keys, up 2 minutes 13 seconds - replication_delay is 0
[07:41:58] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Convert db1090 into a core multiinstance host [puppet] - 10https://gerrit.wikimedia.org/r/429148 (https://phabricator.wikimedia.org/T192979)
[07:43:18] <wikibugs_>	 (03PS1) 10Jcrespo: dbhosts: Convert db1090 into a core multiinstance host [software] - 10https://gerrit.wikimedia.org/r/429149
[07:45:02] <jynus>	 !log stopping db1090 mariadb instance to move its path, port and socket
[07:45:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:32] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Convert db1090 into a core multiinstance host [puppet] - 10https://gerrit.wikimedia.org/r/429148 (https://phabricator.wikimedia.org/T192979)
[07:47:35] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Convert db1090 into a core multiinstance host [puppet] - 10https://gerrit.wikimedia.org/r/429148 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[07:49:24] <wikibugs_>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem)
[07:50:34] <wikibugs_>	 (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem)
[07:51:32] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] s1.hosts: Add db1116:3311 [software] - 10https://gerrit.wikimedia.org/r/429135 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[07:52:22] <wikibugs_>	 (03Merged) 10jenkins-bot: s1.hosts: Add db1116:3311 [software] - 10https://gerrit.wikimedia.org/r/429135 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[07:53:27] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] dbhosts: Convert db1090 into a core multiinstance host [software] - 10https://gerrit.wikimedia.org/r/429149 (owner: 10Jcrespo)
[07:58:00] <marostegui>	 !log Deploy schema change on db1090 - T191519 T188299 T190148
[07:58:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:07] <stashbot>	 T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519
[07:58:07] <stashbot>	 T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148
[07:58:07] <stashbot>	 T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299
[07:58:34] <wikibugs_>	 10Operations, 10Wikispeech, 10Wikispeech-WMSE: TTS server deployment strategy - https://phabricator.wikimedia.org/T193072#4160271 (10akosiaris) > So, Operations can you tell @Lokal_Profil whether docker-compose is a valid deployment strategy? Or if they need to do so something else...  A valid deployment str...
[08:04:31] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0
[08:05:02] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[08:05:39] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Include db1116:3311 in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429150 (https://phabricator.wikimedia.org/T190704)
[08:07:12] <wikibugs_>	 (03PS1) 10Muehlenhoff: Stop using mw-no-tmp.cfg partman recipe and remove it [puppet] - 10https://gerrit.wikimedia.org/r/429151 (https://phabricator.wikimedia.org/T156955)
[08:08:05] <wikibugs_>	 (03CR) 10Jcrespo: [C: 04-1] "not a mediawiki host, shouldn't be on config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429150 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[08:11:08] <wikibugs_>	 (03Abandoned) 10Marostegui: db-eqiad,db-codfw.php: Include db1116:3311 in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429150 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui)
[08:19:03] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] Stop using mw-no-tmp.cfg partman recipe and remove it [puppet] - 10https://gerrit.wikimedia.org/r/429151 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[08:19:19] <wikibugs_>	 (03PS1) 10Muehlenhoff: Remove graphite-dmcache.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429152 (https://phabricator.wikimedia.org/T156955)
[08:24:33] <jynus>	 !log stop and upgrade db1109
[08:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:34] <wikibugs_>	 10Operations, 10OTRS: Upgrade OTRS to 5.0.27 - https://phabricator.wikimedia.org/T193118#4160296 (10akosiaris)
[08:25:56] <wikibugs_>	 10Operations, 10OTRS: Upgrade OTRS to 5.0.27 - https://phabricator.wikimedia.org/T193118#4160310 (10akosiaris) p:05Triage>03Normal
[08:26:34] <wikibugs_>	 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#4160313 (10akosiaris)
[08:26:47] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] Remove graphite-dmcache.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[08:26:53] <wikibugs_>	 10Operations, 10OTRS: Upgrade OTRS to 5.0.27 - https://phabricator.wikimedia.org/T193118#4160296 (10akosiaris)
[08:29:51] <icinga-wm>	 RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0
[08:30:22] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[08:30:42] <wikibugs_>	 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#4160318 (10akosiaris) p:05Triage>03Low Since I 'll probably be the one to work on this, and I need to estimate the amount of work (both operationally as well as agent testing)...
[08:32:23] <moritzm>	 !log re-attempt reimage of mw1246 (failed yesterday with an error on the puppetmaster, testing whether this can be reproduced)
[08:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:39] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Depool db1069 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429153 (https://phabricator.wikimedia.org/T192979)
[08:44:21] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:44:42] <icinga-wm>	 PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[08:45:22] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[08:46:01] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[08:47:06] <librenms-wmf>	 08Warning Alert for device cr1-ulsfo.wikimedia.org - Inbound interface errors
[08:47:46] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Depool db1086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429154 (https://phabricator.wikimedia.org/T192979)
[08:47:48] <wikibugs_>	 10Operations, 10OTRS: Upgrade OTRS to 5.0.27 - https://phabricator.wikimedia.org/T193118#4160326 (10akosiaris) I 've checked the OTRS templates we maintain (https://gerrit.wikimedia.org/g/operations/software/otrs/+/refs/heads/master) and there is no change in them for 5.0.27. So we stay with version `1.0.11` o...
[08:49:42] <icinga-wm>	 RECOVERY - HTTP availability for Varnish on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[08:50:11] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0
[08:50:21] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:50:50] <wikibugs_>	 (03PS1) 10Jcrespo: install_server: Reimage db1109 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/429155
[08:51:13] <moritzm>	 !log reimaging mw1320, mw1321, mw1322 (app servers) to stretch
[08:51:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:45] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] install_server: Reimage db1109 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/429155 (owner: 10Jcrespo)
[08:52:41] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[08:53:42] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[08:54:21] <icinga-wm>	 RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0
[08:55:31] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[08:56:06] <wikibugs_>	 10Operations, 10OTRS: Upgrade OTRS to 5.0.27 - https://phabricator.wikimedia.org/T193118#4160328 (10akosiaris) 05Open>03Resolved The upgrade was easy enough so I just completed it with minimal downtime (a few secs).
[08:56:36] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429154 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[08:57:01] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[08:57:50] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Depool db1086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429154 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[08:59:27] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Depool db1086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429154 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[09:01:17] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1086 (duration: 01m 16s)
[09:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:04] <wikibugs_>	 (03PS2) 10Muehlenhoff: Stop including mediawiki::packages::multimedia for contint [puppet] - 10https://gerrit.wikimedia.org/r/428314
[09:02:24] <marostegui>	 !log Drop prefswitch_survey on s5 and s6 - T173439
[09:02:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:07] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr1-ulsfo.wikimedia.org recovered from Inbound interface errors
[09:09:05] <wikibugs_>	 (03PS2) 10Muehlenhoff: Remove graphite-dmcache.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429152 (https://phabricator.wikimedia.org/T156955)
[09:11:03] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Remove graphite-dmcache.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[09:13:48] <marostegui>	 !log Drop prefswitch_survey on s4 - T173439
[09:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:31] <wikibugs_>	 (03CR) 10ArielGlenn: "This looks fine as far as it goes, but what I had in mind was that these files for each batch be kept around for failed shards and re-used" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man)
[09:15:32] <mark>	 !log Temp disabling cr1-ulsfo:xe-1/2/0 (Chicago transport) due to stability issues
[09:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:05] <marostegui>	 !log Drop prefswitch_survey on s2 - T173439
[09:16:07] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[09:16:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:07] <icinga-wm>	 PROBLEM - Apache HTTP on mw1320 is CRITICAL: connect to address 10.64.32.41 and port 80: Connection refused
[09:27:07] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: connect to address 10.64.32.43 and port 443: Connection refused
[09:27:07] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1321 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:27:08] <icinga-wm>	 PROBLEM - MD RAID on mw1320 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:27:08] <icinga-wm>	 PROBLEM - Check systemd state on mw1322 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[09:27:26] <moritzm>	 ^silencing
[09:30:14] <marostegui>	 !log Drop prefswitch_survey on s7 - T173439
[09:30:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:07] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Repool db1109 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429156
[09:35:08] <wikibugs_>	 10Operations: Upgrade ganeti hosts to stretch - https://phabricator.wikimedia.org/T193121#4160393 (10akosiaris)
[09:35:17] <wikibugs_>	 10Operations: Upgrade ganeti hosts to stretch - https://phabricator.wikimedia.org/T193121#4160406 (10akosiaris) p:05Triage>03Normal
[09:38:03] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1109 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429156 (owner: 10Jcrespo)
[09:39:22] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Repool db1109 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429156 (owner: 10Jcrespo)
[09:39:49] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Repool db1109 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429156 (owner: 10Jcrespo)
[09:40:22] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "LGTM" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev)
[09:41:43] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1109 with low load (duration: 01m 16s)
[09:41:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time
[09:43:10] <wikibugs_>	 (03CR) 10Hoo man: "> This looks fine as far as it goes, but what I had in mind was that these files for each batch be kept around for failed shards and re-us" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man)
[09:43:50] <wikibugs_>	 (03CR) 10Hoo man: Wikidata JSON dump: Only dump batches of ~400,000 pages at once (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man)
[09:45:36] <marostegui>	 !log Drop prefswitch_survey on s3 - T173439
[09:45:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:30] <godog>	 !log eqiad-prod: more weight to ms-be104[0-3] for container/account - T190081
[09:50:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:38] <stashbot>	 T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081
[09:51:01] <wikibugs_>	 10Operations, 10User-fgiunchedi: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4160427 (10fgiunchedi)
[09:53:27] <wikibugs_>	 (03PS1) 10Muehlenhoff: Use mw-raid1.cfg partman recipe for mw1221-mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/429158 (https://phabricator.wikimedia.org/T106381)
[09:57:49] <marostegui>	 !log Drop prefswitch_survey on s1 - T173439
[09:57:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:18] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1321 is OK: OK: nf_conntrack is 0 % full
[09:59:27] <icinga-wm>	 RECOVERY - MD RAID on mw1320 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[10:00:23] <wikibugs_>	 (03PS2) 10Muehlenhoff: Stop using mw-no-tmp.cfg partman recipe and remove it [puppet] - 10https://gerrit.wikimedia.org/r/429151 (https://phabricator.wikimedia.org/T156955)
[10:00:45] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] Use mw-raid1.cfg partman recipe for mw1221-mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/429158 (https://phabricator.wikimedia.org/T106381) (owner: 10Muehlenhoff)
[10:02:27] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 5.895 second response time
[10:04:17] <wikibugs_>	 (03CR) 10Hashar: [C: 031] Stop including mediawiki::packages::multimedia for contint [puppet] - 10https://gerrit.wikimedia.org/r/428314 (owner: 10Muehlenhoff)
[10:07:29] <icinga-wm>	 RECOVERY - Check systemd state on mw1322 is OK: OK - running: The system is fully operational
[10:09:50] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Stop using mw-no-tmp.cfg partman recipe and remove it [puppet] - 10https://gerrit.wikimedia.org/r/429151 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[10:29:18] <moritzm>	 !log reimaging mw1269,  mw1323, mw1324 (app servers) to stretch
[10:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:32] <wikibugs_>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4160536 (10MarcoAurelio)
[10:35:39] <moritzm>	 !log reimaging mw1312  mw1317, mw1339 (API servers) to stretch
[10:35:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:11] <wikibugs_>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4160592 (10MarcoAurelio) p:05Triage>03High
[10:53:34] <wikibugs_>	 (03CR) 10Mobrovac: [C: 031] Enable eventbus Kafka producer snappy compression [puppet] - 10https://gerrit.wikimedia.org/r/429007 (https://phabricator.wikimedia.org/T193080) (owner: 10Ottomata)
[10:59:02] <wikibugs_>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4160845 (10MarcoAurelio) ``` maurelio@deployment-cpjobqueue:/$ sudo du -sh / du: cannot access ‘/proc/32319/task/32319/fd/3’: No such file or directory du: cannot access ‘/proc/32319/task...
[11:04:30] <wikibugs_>	 (03CR) 10Mobrovac: "IMHO it would be simpler to set the default in cassandra::sysctl to what we deem to be a good value for cassandra clusters in general and " [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans)
[11:05:06] <wikibugs_>	 (03PS3) 10Muehlenhoff: Stop including mediawiki::packages::multimedia for contint [puppet] - 10https://gerrit.wikimedia.org/r/428314
[11:05:44] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Stop including mediawiki::packages::multimedia for contint [puppet] - 10https://gerrit.wikimedia.org/r/428314 (owner: 10Muehlenhoff)
[11:10:55] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1001 is OK: (C)5e+05 ge (W)1e+05 ge 8.509e+04 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001
[11:14:45] <wikibugs_>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4160890 (10MarcoAurelio) @Joe Any idea how to fix this? Delete `/var/vda3`? Maybe some work is also needed on `tmpfs`?
[11:16:36] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1003 is OK: (C)5e+05 ge (W)1e+05 ge 9.862e+04 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003
[11:19:01] <wikibugs_>	 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4160899 (10fdans) OK, so to determine the periodicity of the cron job, I ran a city query over ~17,000 IP addresses with:   -  The most current GeoIP d...
[11:22:05] <wikibugs_>	 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4160932 (10faidon) As far as periodicity goes, note that MaxMind [[ https://support.maxmind.com/geoip-faq/geoip2-and-geoip-legacy-databases/how-often-a...
[11:22:12] <wikibugs_>	 10Operations: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155#4160919 (10Volans)
[11:22:21] <wikibugs_>	 10Operations: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155#4160936 (10Volans) p:05Triage>03Normal
[11:27:56] <wikibugs_>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4160939 (10MarcoAurelio) File's plently of:  ``` {"name":"cpjobqueue","hostname":"deployment-cpjobqueue","pid":136,"level":50,"err":{"message":"KafkaConsumer is not connected","name":"cpj...
[11:27:57] <wikibugs_>	 (03PS3) 10Arturo Borrero Gonzalez: labs_bootstrapvz: address labtest issues [puppet] - 10https://gerrit.wikimedia.org/r/428694 (https://phabricator.wikimedia.org/T181523)
[11:31:03] <Urbanecm>	 jouncebot, next
[11:31:08] <jouncebot>	 In 1 hour(s) and 28 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1300)
[11:45:11] <icinga-wm>	 PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds
[11:45:20] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good, just couple of small comments." (036 comments) [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 (https://phabricator.wikimedia.org/T192763) (owner: 10Filippo Giunchedi)
[11:45:41] <icinga-wm>	 PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds
[11:45:51] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds
[11:46:01] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds
[11:46:02] <icinga-wm>	 PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds
[11:46:23] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds
[11:47:09] <moritzm>	 ^ looking
[11:49:03] <icinga-wm>	 RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[11:49:13] <icinga-wm>	 RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient
[11:49:32] <icinga-wm>	 RECOVERY - DPKG on stat1005 is OK: All packages OK
[11:49:52] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational
[11:51:14] <moritzm>	 some number crunching job on stat1005 consumed all memory and nrpe failed to fork/service failed, puppet run restarted it correctly
[11:51:22] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[12:03:01] <logmsgbot>	 !log mobrovac@tin Started deploy [cpjobqueue/deploy@7fbb152]: Support the exclude_topics config stanza
[12:03:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:13] <logmsgbot>	 !log mobrovac@tin Finished deploy [cpjobqueue/deploy@7fbb152]: Support the exclude_topics config stanza (duration: 01m 12s)
[12:04:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:04] <wikibugs_>	 (03PS4) 10Arturo Borrero Gonzalez: labs_bootstrapvz: address labtest issues [puppet] - 10https://gerrit.wikimedia.org/r/428694 (https://phabricator.wikimedia.org/T181523)
[12:14:21] <wikibugs_>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] labs_bootstrapvz: address labtest issues [puppet] - 10https://gerrit.wikimedia.org/r/428694 (https://phabricator.wikimedia.org/T181523) (owner: 10Arturo Borrero Gonzalez)
[12:18:48] <wikibugs_>	 (03PS4) 10Muehlenhoff: Don't include mediawiki::multimedia on labweb* [puppet] - 10https://gerrit.wikimedia.org/r/428298
[12:25:08] <wikibugs_>	 (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/11041/" [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff)
[12:31:56] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.133 second response time
[12:38:28] <wikibugs_>	 10Operations, 10monitoring: Monitor the BIOS boot order and parameters - https://phabricator.wikimedia.org/T193160#4161046 (10Volans) p:05Triage>03Normal
[12:39:56] <wikibugs_>	 (03PS2) 10Muehlenhoff: Use mw-raid1.cfg partman recipe for mw1221-mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/429158 (https://phabricator.wikimedia.org/T106381)
[12:42:45] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Use mw-raid1.cfg partman recipe for mw1221-mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/429158 (https://phabricator.wikimedia.org/T106381) (owner: 10Muehlenhoff)
[12:42:48] <wikibugs_>	 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4161078 (10Milimetric) thanks @faidon, we were just seeing if maybe the accuracy of the old databases is really high, we can schedule the jobs less oft...
[12:48:24] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429174 (https://phabricator.wikimedia.org/T190148)
[12:49:32] <marostegui>	 jouncebot: next
[12:49:32] <jouncebot>	 In 0 hour(s) and 10 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1300)
[12:50:08] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429174 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[12:51:09] <gehel>	 !log reindexing lost updates on elasticsearch - T193112
[12:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:15] <stashbot>	 T193112: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112
[12:51:23] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429174 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[12:51:38] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429174 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[12:52:35] <closedmouth>	 no more jerkins
[12:53:17] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1113 for alter table (duration: 01m 33s)
[12:53:20] <marostegui>	 !log Deploy schema change on db1113:3312 - T191519 T188299 T190148
[12:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:30] <stashbot>	 T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519
[12:53:30] <stashbot>	 T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148
[12:53:30] <stashbot>	 T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299
[12:53:56] <wikibugs_>	 (03PS1) 10Muehlenhoff: Remove partman fallback for mediawiki hosts to single disk partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429175 (https://phabricator.wikimedia.org/T106381)
[12:59:17] <moritzm>	 I'm mostly afk for 2-3 hours, need to get the kids from day care today/look after them throughout the afternoon, I'll be around again later
[12:59:36] <moritzm>	 wrong channel :-)
[13:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1300).
[13:00:04] <jouncebot>	 Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:07] <Urbanecm>	 here
[13:00:24] <zeljkof>	 I can SWAT today
[13:00:44] <zeljkof>	 Urbanecm: I'll ping you when a patch is ready at mwdebug, anything special today? :)
[13:01:07] <wikibugs_>	 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#2345955 (10JAllemandou) +1 for weekly on wednesday. Thanks @fdans  and @faidon  :)
[13:01:49] <Urbanecm>	 zeljkof, can you please both patches directly to production? First one is just replacement of static resources, the second one is testable only to those who have "import right"
[13:01:56] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1972 bytes in 0.133 second response time
[13:02:09] <Urbanecm>	 (BTW I'll have third patch, if possible...)
[13:02:44] <zeljkof>	 Urbanecm: sure, the first patch needs cache purges, right?
[13:03:00] <zeljkof>	 Urbanecm: third patch should not be a problem
[13:04:18] <Urbanecm>	 zeljkof, yes
[13:04:21] <wikibugs_>	 (03PS1) 10Urbanecm: Change chapcomwiki's logo, add HD logo for chapcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429178 (https://phabricator.wikimedia.org/T193024)
[13:04:30] <wikibugs_>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 031] "Any special reason why CirrusSearch uses the full /bigdata/namespace/… URL while WikibaseQualityConstraints uses the /sparql shortcut? :)" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev)
[13:05:37] <wikibugs_>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428953 (https://phabricator.wikimedia.org/T193028) (owner: 10Urbanecm)
[13:05:49] <Urbanecm>	 zeljkof, added to the calendar.
[13:06:04] <zeljkof>	 Urbanecm: ok, I see it
[13:06:20] <Urbanecm>	 ok
[13:06:53] <wikibugs_>	 (03Merged) 10jenkins-bot: Fix pixelization of new wiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428953 (https://phabricator.wikimedia.org/T193028) (owner: 10Urbanecm)
[13:09:07] <logmsgbot>	 !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:428953|Fix pixelization of new wiki logos (T193028)]] (duration: 01m 17s)
[13:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:14] <stashbot>	 T193028: Logos for new wikis are pixelized - https://phabricator.wikimedia.org/T193028
[13:10:43] <zeljkof>	 Urbanecm: 428953 is deployed, you should see the updates at mwdebug, purging cache
[13:11:03] <wikibugs_>	 (03CR) 10jenkins-bot: Fix pixelization of new wiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428953 (https://phabricator.wikimedia.org/T193028) (owner: 10Urbanecm)
[13:11:10] <Urbanecm>	 ok
[13:13:03] <zeljkof>	 Urbanecm: cache purged, please check
[13:13:21] <Urbanecm>	 Working, thanks
[13:14:04] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Switch db1086 row format to statement [puppet] - 10https://gerrit.wikimedia.org/r/429182 (https://phabricator.wikimedia.org/T192979)
[13:14:10] <wikibugs_>	 (03PS3) 10Zfilipin: Add all Hindi projects plus meta as import sources for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428952 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm)
[13:14:59] <wikibugs_>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428952 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm)
[13:16:27] <wikibugs_>	 (03Merged) 10jenkins-bot: Add all Hindi projects plus meta as import sources for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428952 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm)
[13:17:14] <wikibugs_>	 (03CR) 10jenkins-bot: Add all Hindi projects plus meta as import sources for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428952 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm)
[13:17:21] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Switch db1086 row format to statement [puppet] - 10https://gerrit.wikimedia.org/r/429182 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[13:18:02] <wikibugs_>	 (03PS2) 10Zfilipin: Change chapcomwiki's logo, add HD logo for chapcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429178 (https://phabricator.wikimedia.org/T193024) (owner: 10Urbanecm)
[13:19:09] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:428952|Add all Hindi projects plus meta as import sources for hiwikimedia (T188366)]] (duration: 01m 17s)
[13:19:12] <zeljkof>	 Urbanecm: 428952 is deployed, but nothing you can do, right?
[13:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:16] <stashbot>	 T188366: Create Hindi Wikimedian User Group Site - https://phabricator.wikimedia.org/T188366
[13:19:25] <Urbanecm>	 Yes, I'm not an sysop/steward. 
[13:19:47] <zeljkof>	 Urbanecm: 429178 is testable at mwdebug, or deploying directly?
[13:20:16] <Urbanecm>	 Please deploy directly as well
[13:20:29] <zeljkof>	 Urbanecm: ok, will ping you when deployed and cache purged
[13:20:33] <Urbanecm>	 ack
[13:22:40] <wikibugs_>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429178 (https://phabricator.wikimedia.org/T193024) (owner: 10Urbanecm)
[13:22:54] <zeljkof>	 Urbanecm: argh,
[13:22:57] <zeljkof>	 forgot
[13:23:00] <Urbanecm>	 what happened?
[13:23:04] <zeljkof>	 argh, new keyboard :D
[13:23:16] <zeljkof>	 space and enter very near :D
[13:23:29] <zeljkof>	 anyway, I forgot about the new rule for deployments, and merged 429178
[13:23:44] <zeljkof>	 it should not be deployed by new rules, it would have to be split into two commits
[13:23:45] <Urbanecm>	 I'm not aware about a new rule
[13:23:48] <zeljkof>	 let me find the diff
[13:24:16] <zeljkof>	 Urbanecm: https://wikitech.wikimedia.org/w/index.php?title=SWAT_deploys&type=revision&diff=1789212&oldid=1777024
[13:24:18] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Switch db1086 row format to statement, this time for real [puppet] - 10https://gerrit.wikimedia.org/r/429184
[13:24:38] <zeljkof>	 no problem for this patch, and it's my mistake for merging it, but for future reference
[13:24:41] <wikibugs_>	 (03Merged) 10jenkins-bot: Change chapcomwiki's logo, add HD logo for chapcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429178 (https://phabricator.wikimedia.org/T193024) (owner: 10Urbanecm)
[13:24:42] <Urbanecm>	 Sure
[13:24:54] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Switch db1086 row format to statement, this time for real [puppet] - 10https://gerrit.wikimedia.org/r/429184
[13:24:58] <wikibugs_>	 (03CR) 10jenkins-bot: Change chapcomwiki's logo, add HD logo for chapcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429178 (https://phabricator.wikimedia.org/T193024) (owner: 10Urbanecm)
[13:25:01] <zeljkof>	 Urbanecm: see T187761 for info
[13:25:01] <stashbot>	 T187761: Proposal: Effective immediately, disallow multi-sync patch deployment - https://phabricator.wikimedia.org/T187761
[13:25:13] <Urbanecm>	 Well...it can be single-sync
[13:25:15] <Urbanecm>	 scap sync :D
[13:25:27] <Urbanecm>	 Will have a look
[13:25:32] <zeljkof>	 well, for this kind of patch, it's an overkill
[13:25:44] <Urbanecm>	 It is, but it will give you just one sync :)
[13:25:47] <zeljkof>	 and I am aware of dependencies, so no problem 
[13:26:12] <Urbanecm>	 What should be the patch-size in the future? Upload static files in one patch and change IS.php in another patch (depending on the first one)?
[13:26:22] <zeljkof>	 Urbanecm: yes
[13:26:51] <Urbanecm>	 Well...at least now I really cannot see the rationale, but I'll read the task and ask later if I'll have questions
[13:27:01] <zeljkof>	 the point is one sync per patch, since that is how our CI tests changes, we had some problems with patches that require multiple syncs
[13:27:33] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Switch db1086 row format to statement, this time for real [puppet] - 10https://gerrit.wikimedia.org/r/429184 (owner: 10Jcrespo)
[13:27:38] <zeljkof>	 comment on the task if you have any questions, I think one of your patches (similar to this one) is an example there
[13:29:04] <Urbanecm>	 saw it :)
[13:30:10] <logmsgbot>	 !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:429178|Change chapcomwikis logo, add HD logo for chapcomwiki (T193024)]] (duration: 01m 16s)
[13:30:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:16] <stashbot>	 T193024: Change AffCom Wiki logo - https://phabricator.wikimedia.org/T193024
[13:31:37] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:429178|Change chapcomwikis logo, add HD logo for chapcomwiki (T193024)]] (duration: 01m 16s)
[13:31:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:06] <zeljkof>	 Urbanecm: 429178 deployed, cache purged, please check and thanks for deploying with #releng again ;)
[13:34:20] <Urbanecm>	 One of weeks when I'm using every EU SWAT :D
[13:34:22] <Urbanecm>	 will do
[13:34:44] <zeljkof>	 that is worth a t-shirt :D
[13:35:05] <zeljkof>	 I did not break wikipedia, but I have tried
[13:35:07] <zeljkof>	 ;)
[13:35:16] <zeljkof>	 !log EU SWAT finished
[13:35:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:32] <Urbanecm>	 Working, thanks for your deployment
[13:35:57] <Urbanecm>	 In fact, the SWAT process didn't do what is required from it
[13:36:18] <Urbanecm>	 "SWAT (...) is responsible for breaking the site on a regular basis"
[13:36:22] <zeljkof>	 :D
[13:36:34] <zeljkof>	 we do our best, but the tooling these days...
[13:36:42] <zeljkof>	 does not let you shoot yourself in the foot
[13:37:44] <Urbanecm>	 Well...delete /srv/mediawiki-stagging/wmf-config/InitialiseSettings.php and I'm sure something will be broken :D
[13:38:24] <Urbanecm>	 (of course with a sync :))
[13:41:56] <jynus>	 can I work with mediawiki config, right?
[13:42:01] <jynus>	 *I can
[13:46:34] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979)
[13:47:43] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[13:48:07] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979)
[13:51:31] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[13:51:33] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/429175 (https://phabricator.wikimedia.org/T106381) (owner: 10Muehlenhoff)
[13:53:45] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[13:55:46] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1069, repool db1086 (duration: 01m 16s)
[13:55:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:17] <marostegui>	 !log Compress enwiki on db1116:3311 - T190704
[13:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:23] <stashbot>	 T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704
[13:58:26] <wikibugs_>	 (03PS4) 10Fdans: Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390
[13:59:48] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[14:05:13] <wikibugs_>	 (03PS1) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726)
[14:05:53] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn)
[14:06:37] <wikibugs_>	 (03CR) 10Ottomata: "I also notice that throughout the script and puppet class, you refer to file paths as "DIR" "LOCATION" and "ROUTE".  Let's be consistent! " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428390 (owner: 10Fdans)
[14:20:00] <milimetric>	 fdans: for fun you can check out this old test someone made of the maxmind database and its accuracy: https://meta.wikimedia.org/wiki/MaxMindCityTesting
[14:20:08] <milimetric>	 if you want to *really* have fun, you can update the results
[14:22:21] <fdans>	 milimetric: wow that really does sound like fun
[14:23:40] <milimetric>	 :)
[14:25:02] <jynus>	 !log stop db1069 for cloning it away
[14:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:30] <hashar>	 milimetric: with our world wide coverage, maybe one day we will end up building our own geoip database :]
[14:26:07] <milimetric>	 hashar: while that'd be cool, I'm not sure we have quite enough coverage :)
[14:26:07] <anomie>	 !log Running populateRevisionLength.php on group 1 for T192189
[14:26:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:14] <stashbot>	 T192189: RevisionArchiveRecord incorrectly changes null ar_len to 0 - https://phabricator.wikimedia.org/T192189
[14:27:12] <milimetric>	 anomie: how long was ar_len set incorrectly for?  And is this script updating it for all history when you're done?
[14:30:38] <anomie>	 milimetric: rev_len was set incorrectly for undeletions that happened between 1.31.0-wmf.23 and 1.31.0-wmf.30 (fixed in 1.32.0-wmf.1). Subsequent deletions may have copied the error to ar_len, and moves or other things that copy the existing revision row might have copied the incorrect value. This is updating the whole history, yes.
[14:31:13] <milimetric>	 great, thanks anomie 
[14:31:59] <anomie>	 milimetric: ... Clarification: that's only undeletions of old revisions where ar_len was NULL, not all undeletions.
[14:32:35] <anomie>	 And this run will also be populating all those old revisions where rev_len or ar_len is null.
[14:33:43] <milimetric>	 that's ok, I was checking to see how it would affect stats, and it's a relatively short period of time, and relatively small set of articles impacted, so while numbers will change I don't think it'll shift the overall metrics too much
[14:37:51] <wikibugs_>	 (03CR) 10Eevans: "> IMHO it would be simpler to set the default in cassandra::sysctl to" [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans)
[14:51:47] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@f2f7a84]: Commit offsets for non matched messages from time to time.
[14:51:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:14] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@f2f7a84]: Commit offsets for non matched messages from time to time. (duration: 01m 26s)
[14:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:38] <icinga-wm>	 PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds
[15:02:48] <icinga-wm>	 PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds
[15:02:59] <icinga-wm>	 PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds
[15:03:09] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds
[15:03:18] <icinga-wm>	 PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds
[15:03:28] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds
[15:03:49] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds
[15:09:24] <wikibugs_>	 (03PS1) 10Arturo Borrero Gonzalez: labs_bootstrapvz: firstboot.sh: bring back some resolv.conf magic [puppet] - 10https://gerrit.wikimedia.org/r/429211 (https://phabricator.wikimedia.org/T181523)
[15:09:33] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Add db1090:s7 to configuration [puppet] - 10https://gerrit.wikimedia.org/r/429212 (https://phabricator.wikimedia.org/T192979)
[15:10:49] <wikibugs_>	 (03PS1) 10Jcrespo: dbhosts: Add db1090:s7 to configuration [software] - 10https://gerrit.wikimedia.org/r/429213
[15:13:10] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] mariadb: Add db1090:s7 to configuration [puppet] - 10https://gerrit.wikimedia.org/r/429212 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:13:22] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] dbhosts: Add db1090:s7 to configuration [software] - 10https://gerrit.wikimedia.org/r/429213 (owner: 10Jcrespo)
[15:14:05] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979)
[15:16:21] <wikibugs_>	 (03CR) 10Marostegui: mariadb: Change db1090 to be a multiinstance host for s2 and s7 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:17:36] <wikibugs_>	 (03CR) 10Jcrespo: "I had no plans to do the repooling yet, but I agree better to add it now (even depooled) to avoid accidents." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:18:37] <mutante>	 !log added LDAP user tschumann to "nda" group (T192549)
[15:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:44] <stashbot>	 T192549: LDAP access for group 'nda' for Tobias Schumann (WMDE) - https://phabricator.wikimedia.org/T192549
[15:21:45] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979)
[15:23:06] <wikibugs_>	 (03CR) 10Jcrespo: "^ping" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:23:19] <wikibugs_>	 (03CR) 10Marostegui: mariadb: Change db1090 to be a multiinstance host for s2 and s7 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:25:28] <icinga-wm>	 RECOVERY - Disk space on stat1005 is OK: DISK OK
[15:25:28] <icinga-wm>	 RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[15:25:30] <wikibugs_>	 (03PS3) 10Jcrespo: mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979)
[15:25:39] <icinga-wm>	 RECOVERY - configured eth on stat1005 is OK: OK - interfaces up
[15:25:48] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:25:49] <icinga-wm>	 RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient
[15:25:55] <wikibugs_>	 (03CR) 10Jcrespo: "Not a pain, I think it is very useful. I may have asked you something similar in the past." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:26:09] <icinga-wm>	 RECOVERY - DPKG on stat1005 is OK: All packages OK
[15:26:18] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational
[15:27:33] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Add db1090:s7 to configuration [puppet] - 10https://gerrit.wikimedia.org/r/429212 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:28:07] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] dbhosts: Add db1090:s7 to configuration [software] - 10https://gerrit.wikimedia.org/r/429213 (owner: 10Jcrespo)
[15:28:28] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[15:29:07] <wikibugs_>	 (03PS1) 10Chad: group2 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429216
[15:29:28] <wikibugs_>	 (03CR) 10Chad: [C: 04-2] "for later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429216 (owner: 10Chad)
[15:30:02] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:30:11] <wikibugs_>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4161601 (10Lea_WMDE)
[15:31:37] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:31:53] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo)
[15:33:24] <wikibugs_>	 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4161611 (10Lea_WMDE) @MoritzMuehlenhoff as discussed I'm checking in at the end of April :) Is there any news about the wikidiff2 update sche...
[15:33:33] <mobrovac>	 jynus: marostegui: can i steal tin for a mwconfig deploy from you for 10-15 mins?
[15:33:49] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Thu 2018-04-26 15:33:47 UTC.
[15:33:50] <jynus>	 can you wait 3 minutes?
[15:33:55] <jynus>	 I just merged a change
[15:34:31] <mobrovac>	 sure sure
[15:34:39] <mobrovac>	 just ping me when you are done
[15:35:19] <icinga-wm>	 PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:36:19] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Add db1090 as multiinstance (duration: 01m 17s)
[15:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:50] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Add db1090 as multiinstance (duration: 01m 16s)
[15:37:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:06] <wikibugs_>	 (03PS7) 10Mobrovac: Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[15:40:06] <jynus>	 there could be issues?
[15:40:23] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[15:41:28] <jynus>	 but it is not the deploy
[15:41:43] <jynus>	 there are issues with refreshcount deadlocks
[15:41:55] <wikibugs_>	 (03CR) 10Mobrovac: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[15:42:31] <jynus>	 mobrovac: you can continue
[15:42:46] <mobrovac>	 kk thnx jynus
[15:46:31] <wikibugs_>	 (03CR) 10Mobrovac: [C: 032] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[15:48:19] <wikibugs_>	 (03PS1) 10Ottomata: Temporarly remove partman recipe for kafka main hosts [puppet] - 10https://gerrit.wikimedia.org/r/429218 (https://phabricator.wikimedia.org/T192832)
[15:49:19] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Temporarly remove partman recipe for kafka main hosts [puppet] - 10https://gerrit.wikimedia.org/r/429218 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata)
[15:49:50] <wikibugs_>	 10Operations, 10fundraising-tech-ops, 10netops: NAT for new fundraising bastion - https://phabricator.wikimedia.org/T193177#4161644 (10cwdent)
[15:51:17] <wikibugs_>	 10Operations, 10fundraising-tech-ops, 10netops: NAT for new fundraising bastion - https://phabricator.wikimedia.org/T193177#4161671 (10cwdent)
[15:51:30] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[15:54:22] <wikibugs_>	 (03CR) 10Mobrovac: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[15:55:39] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[15:55:59] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[15:56:40] <wikibugs_>	 (03PS3) 10Filippo Giunchedi: profile: install SMART checks after 'raid' fact is available. [puppet] - 10https://gerrit.wikimedia.org/r/428947 (https://phabricator.wikimedia.org/T132324)
[15:56:42] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: memcached: deprecate Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454)
[15:56:47] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: elasticsearch: deprecate Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/429222 (https://phabricator.wikimedia.org/T183454)
[15:56:49] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: ores: deprecate Diamond redis collector [puppet] - 10https://gerrit.wikimedia.org/r/429223 (https://phabricator.wikimedia.org/T183454)
[15:56:54] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: Deprecate Diamond pdns collectors [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454)
[15:56:56] <godog>	 let's see how many -1s
[15:56:58] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: Deprecate Diamond tcpconnstate and nfconntrackcount [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454)
[15:57:24] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] memcached: deprecate Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi)
[15:57:36] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: deprecate Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/429222 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi)
[15:58:06] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Deprecate Diamond pdns collectors [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi)
[15:58:11] <wikibugs_>	 (03CR) 10Awight: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/429223 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi)
[15:58:40] <mobrovac>	 hashar: what's wrong with jenkins? - https://integration.wikimedia.org/ci/job/operations-mw-config-typos/19085/console
[15:58:53] <mobrovac>	 i keep getting that for different tests
[15:59:26] <wikibugs_>	 (03CR) 10Mobrovac: [V: 032 C: 032] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1600).
[16:00:04] <jouncebot>	 thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:51] <wikibugs_>	 (03CR) 10jenkins-bot: Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko)
[16:01:42] * thcipriani waves
[16:01:48] <jynus>	 I am around
[16:02:04] <jynus>	 is it as trivial as it looks?
[16:02:10] <logmsgbot>	 !log ppchelko@tin Started deploy [cpjobqueue/deploy@bf34e00]: Enable all jobs for test, test2, testwikidata and mediawiki. T190327
[16:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:16] <stashbot>	 T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327
[16:02:23] <mobrovac>	 jynus: please wait a sec, still syncing
[16:02:33] <mobrovac>	 30 seconds, to be precise
[16:02:50] <thcipriani>	 jynus: indeed it should be :) it's already on beta installs a program and modifies a config value that isn't used in prod yet.
[16:03:02] <logmsgbot>	 !log ppchelko@tin Finished deploy [cpjobqueue/deploy@bf34e00]: Enable all jobs for test, test2, testwikidata and mediawiki. T190327 (duration: 00m 51s)
[16:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:10] <logmsgbot>	 !log mobrovac@tin Synchronized wmf-config/jobqueue.php: JobQueue: Use EventBus for most jobs for test wikis - T190327 (duration: 01m 15s)
[16:03:12] <jynus>	 we will wait for mobrovac to finish using scap, ok?
[16:03:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:27] <thcipriani>	 sounds good to me :)
[16:03:31] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: memcached: deprecate Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454)
[16:03:33] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: elasticsearch: deprecate Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/429222 (https://phabricator.wikimedia.org/T183454)
[16:03:35] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: ores: deprecate Diamond redis collector [puppet] - 10https://gerrit.wikimedia.org/r/429223 (https://phabricator.wikimedia.org/T183454)
[16:03:37] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: Deprecate Diamond pdns collectors [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454)
[16:03:39] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: Deprecate Diamond tcpconnstate and nfconntrackcount [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454)
[16:03:40] <jynus>	 thcipriani: I guess it will touch mainly terbium and the other passive deployment hosts, right?
[16:03:47] <mobrovac>	 ok i'm done, jynus you are good to go
[16:03:56] <wikibugs_>	 (03PS2) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726)
[16:04:40] <wikibugs_>	 (03PS2) 10Jcrespo: Scap: MediaWiki Canary: setup swagger checks [puppet] - 10https://gerrit.wikimedia.org/r/428721 (https://phabricator.wikimedia.org/T136839) (owner: 10Thcipriani)
[16:05:37] <thcipriani>	 jynus: it will update the scap.cfg on many hosts, but that should mostly be a no-op. The only place it installs new software should be on tin, naos, and the new deployment machine, deployment1001
[16:05:44] <thcipriani>	 s/mostly//
[16:05:45] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn)
[16:06:03] <jynus>	 sorry, I actually meant tin
[16:06:14] <jynus>	 yes
[16:06:20] <wikibugs_>	 (03CR) 10Mobrovac: "Ok, there are two things here, so I'll try to address them separately." [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans)
[16:06:28] <jynus>	 my mind is distracted
[16:06:57] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] Scap: MediaWiki Canary: setup swagger checks [puppet] - 10https://gerrit.wikimedia.org/r/428721 (https://phabricator.wikimedia.org/T136839) (owner: 10Thcipriani)
[16:08:05] <jynus>	 running puppet on tin
[16:08:09] <wikibugs_>	 10Operations, 10fundraising-tech-ops, 10netops: NAT for new fundraising bastion - https://phabricator.wikimedia.org/T193177#4161775 (10ayounsi) a:03ayounsi ```lang=diff [edit security nat static rule-set static-nat] +      rule frbast1001 { +          match { +              destination-address 208.80.155.8...
[16:08:20] <wikibugs_>	 10Operations, 10fundraising-tech-ops, 10netops: NAT for new fundraising bastion - https://phabricator.wikimedia.org/T193177#4161777 (10ayounsi) 05Open>03Resolved
[16:08:51] <jynus>	 thcipriani: for testing, do we deploy something?
[16:10:48] <wikibugs_>	 10Operations, 10ops-ulsfo, 10Traffic, 10netops: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4161795 (10BBlack) Note this will involve a planned ulsfo site outage, with its traffic falling back to codfw.  If things go well the outage should be brief, the 5h estimate above is...
[16:10:51] <wikibugs_>	 (03PS1) 10Volans: wmf-auto-reimage: verify BIOS boot parameters [puppet] - 10https://gerrit.wikimedia.org/r/429229
[16:10:53] <wikibugs_>	 (03PS1) 10Volans: wmf-auto-reimage: allow to mask systemd services [puppet] - 10https://gerrit.wikimedia.org/r/429230
[16:11:07] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231
[16:11:40] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "All good, we're not using those metrics directly anymore." [puppet] - 10https://gerrit.wikimedia.org/r/429222 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi)
[16:12:34] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 (owner: 10Jcrespo)
[16:12:38] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1229 is CRITICAL: Host mw1229 is not in mediawiki-installation dsh group
[16:12:45] <thcipriani>	 jynus: currently scap doesn't use this, so just making sure that service-checker-swagger was installed is mostly the test :)
[16:12:51] <jynus>	 ah, ok
[16:12:59] <wikibugs_>	 (03CR) 10Chad: Add gerrit.wmfusercontent.org DNS entry (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad)
[16:13:19] <jynus>	 let me do the dummy deploy anyway to test there was no regression :-)
[16:13:43] <thcipriani>	 sure thing, never a bad plan :)
[16:14:19] <wikibugs_>	 (03PS3) 10Eevans: cassandra: increase `vm.max_map_count` to 1048575 [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083)
[16:15:00] <jynus>	 has anyone ever seen the CI error "stderr: error: unable to write file wmf-config/wikitech.php"
[16:18:25] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231
[16:19:38] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 (owner: 10Jcrespo)
[16:21:40] <wikibugs_>	 (03PS3) 10Jcrespo: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231
[16:22:56] <wikibugs_>	 (03CR) 10Dzahn: "hmm.. i tend to say let's use misc varnish because phab.wmfusercontent.org does as well" [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad)
[16:23:05] <wikibugs_>	 (03PS4) 10Jcrespo: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231
[16:23:16] <wikibugs_>	 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4161855 (10Krinkle) The currently known run-time issues with MediaWiki on PHP7 and/or HHVM  have been fixed (mainly T184854)...
[16:23:43] <wikibugs_>	 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4161858 (10Krinkle) I've put a straw-man up at T176370#4161855.
[16:24:19] <jynus>	 showing https://integration.wikimedia.org/ci/job/operations-mw-config-typos/19088/console for "line is too long"
[16:24:25] <jynus>	 is a bit missleading
[16:25:05] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 (owner: 10Jcrespo)
[16:26:27] <wikibugs_>	 (03Merged) 10jenkins-bot: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 (owner: 10Jcrespo)
[16:26:53] <wikibugs_>	 (03PS5) 10Andrew Bogott: Don't include mediawiki::multimedia on labweb* [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff)
[16:28:52] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Fix comment, test scap (duration: 01m 12s)
[16:28:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:55] <wikibugs_>	 (03CR) 10jenkins-bot: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 (owner: 10Jcrespo)
[16:31:26] <wikibugs_>	 (03PS1) 10EBernhardson: Lower CirrusSearch delayed job drop to 2 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429233
[16:33:12] <wikibugs_>	 10Operations, 10netops: ulsfo<->eqord BGP down - https://phabricator.wikimedia.org/T192114#4161917 (10ayounsi) 05Open>03Resolved TTL fixed. Sessions up.
[16:35:18] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] Don't include mediawiki::multimedia on labweb* [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff)
[16:36:09] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] "Thanks for your patience with this special case" [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff)
[16:38:47] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[16:39:22] <wikibugs_>	 (03CR) 10Chad: "My only concern is that we'd also be exposing Gerrit itself, not just the non-proxied webroot stuff." [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad)
[16:40:57] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:41:08] <icinga-wm>	 PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[16:42:37] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[16:42:57] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[16:43:57] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:45:08] <icinga-wm>	 RECOVERY - HTTP availability for Varnish on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[16:46:35] <wikibugs_>	 (03PS2) 10Muehlenhoff: Remove obsolete mediawiki::packages::fonts from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428886
[16:50:33] <wikibugs_>	 (03PS2) 10Herron: scap::target: List allowed service commands, instead of wildcard [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier)
[16:51:07] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] scap::target: List allowed service commands, instead of wildcard [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier)
[16:51:10] <wikibugs_>	 (03PS3) 10Herron: scap::target: List allowed service commands, instead of wildcard [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier)
[16:51:38] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[16:51:57] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[16:57:34] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Remove obsolete mediawiki::packages::fonts from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428886 (owner: 10Muehlenhoff)
[16:57:40] <wikibugs_>	 (03CR) 10Herron: [C: 031] "Added a few more commands.  Seems ok to me but would like feedback from RelEng before merging." [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier)
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1700).
[17:00:14] <subbu>	 no parsoid deploy today
[17:00:21] <awight>	 ORES has a slightly exciting bunch of work to deploy.
[17:00:49] <moritzm>	 !log installing systemd SUA update for stretch
[17:00:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:20] <wikibugs_>	 (03CR) 10DCausse: [C: 031] Lower CirrusSearch delayed job drop to 2 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429233 (owner: 10EBernhardson)
[17:04:12] <wikibugs_>	 (03PS1) 10Andrew Bogott: bootstrap_vz: re-order the ldap phases [puppet] - 10https://gerrit.wikimedia.org/r/429239
[17:04:49] <wikibugs_>	 (03PS2) 10Ottomata: Enable eventbus Kafka producer snappy compression [puppet] - 10https://gerrit.wikimedia.org/r/429007 (https://phabricator.wikimedia.org/T193080)
[17:05:54] <wikibugs_>	 (03CR) 10Muehlenhoff: "When it's all standardised to a common recipe, we can simply apply this to mw[12]*" [puppet] - 10https://gerrit.wikimedia.org/r/429175 (https://phabricator.wikimedia.org/T106381) (owner: 10Muehlenhoff)
[17:05:57] <logmsgbot>	 !log awight@tin Started deploy [ores/deploy@5b27205]: ORES: update to revscoring 2.2.2, T192917
[17:06:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:03] <stashbot>	 T192917: Rebuild all models for revscoring 2.2.2 - https://phabricator.wikimedia.org/T192917
[17:06:12] <wikibugs_>	 (03PS1) 10Herron: rsyslog: send auth,authpriv.* to central log hosts [puppet] - 10https://gerrit.wikimedia.org/r/429240
[17:06:18] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Enable eventbus Kafka producer snappy compression [puppet] - 10https://gerrit.wikimedia.org/r/429007 (https://phabricator.wikimedia.org/T193080) (owner: 10Ottomata)
[17:06:22] <wikibugs_>	 10Operations, 10CirrusSearch, 10Discovery, 10Search-Platform-Programs, 10Discovery-Search (Current work): Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4162020 (10EBjune)
[17:06:35] <wikibugs_>	 (03PS2) 10Andrew Bogott: bootstrap_vz: re-order the ldap phases [puppet] - 10https://gerrit.wikimedia.org/r/429239
[17:07:10] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] bootstrap_vz: re-order the ldap phases [puppet] - 10https://gerrit.wikimedia.org/r/429239 (owner: 10Andrew Bogott)
[17:07:44] <wikibugs_>	 10Operations, 10CirrusSearch, 10Discovery, 10Search-Platform-Programs, 10Discovery-Search (Current work): Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4162026 (10debt) p:05Unbreak!>03High
[17:07:48] <wikibugs_>	 (03PS3) 10Muehlenhoff: Remove obsolete mediawiki multimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/428634
[17:08:52] <wikibugs_>	 (03CR) 10Herron: [C: 04-2] "Needs discussion before merging" [puppet] - 10https://gerrit.wikimedia.org/r/429240 (owner: 10Herron)
[17:09:09] <ottomata>	 !log applying compression_type=snappy to eventbus service kafka producer
[17:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:09] <icinga-wm>	 PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools]
[17:10:32] <awight>	 mutante: Heads-up that the ORES venv patch is currently landing on production.  When the deployment is finished, it would be great if you could rm -rf the old directory, if you have the time.
[17:11:40] <mutante>	 awight: is it time-sensitive? i am happy to do that but i have to go afk for like.. maybe 45min
[17:13:45] <awight>	 mutante: No, there’s no rush.  I wanted you to be aware of the change in general, but the cleanup can happen any time, it’s just for sanity and not anything functional.
[17:14:04] <awight>	 Thanks for the help on beta!
[17:15:01] <wikibugs_>	 (03CR) 10Mobrovac: [C: 031] cassandra: increase `vm.max_map_count` to 1048575 [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans)
[17:15:27] <mutante>	 awight: will do! thanks
[17:15:29] <mutante>	 bbl
[17:18:49] <wikibugs_>	 10Operations, 10fundraising-tech-ops, 10netops: New PFW policy - https://phabricator.wikimedia.org/T193189#4162058 (10cwdent)
[17:24:01] <wikibugs_>	 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967#4162092 (10Gehel) Removing discovery / search from this ticket, since it is really not related to search.
[17:27:17] <logmsgbot>	 !log awight@tin Finished deploy [ores/deploy@5b27205]: ORES: update to revscoring 2.2.2, T192917 (duration: 21m 20s)
[17:27:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:25] <stashbot>	 T192917: Rebuild all models for revscoring 2.2.2 - https://phabricator.wikimedia.org/T192917
[17:31:46] <awight>	 Finished deploying ORES and it looks healthy :-)
[17:31:46] <wikibugs_>	 (03CR) 10Gehel: "minor comment inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429233 (owner: 10EBernhardson)
[17:32:15] <wikibugs_>	 10Operations, 10Puppet, 10Analytics, 10Cassandra, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10Eevans) The restbase cluster has been upgraded package-wise, but a rolling restart still needs to be scheduled.
[17:32:32] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good, thanks. I'll smoketest that with a job runner reimage next week." [puppet] - 10https://gerrit.wikimedia.org/r/429230 (owner: 10Volans)
[17:34:22] <wikibugs_>	 10Operations, 10Puppet, 10Analytics, 10Cassandra, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10MoritzMuehlenhoff) >>! In T192948#4162125, @Eevans wrote: > The restbase cluster has been upgraded package-wise, but a rolling rest...
[17:35:09] <icinga-wm>	 RECOVERY - puppet last run on mw2286 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[17:36:29] <wikibugs_>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4162142 (10MarcoAurelio) Logs are also quite heavy:  ``` maurelio@deployment-cpjobqueue:/srv/log/cpjobqueue$ sudo ls -lash *  11G -rw-r--r-- 1 cpjobqueue cpjobqueue  11G Apr 25 00:57 main...
[17:37:53] <wikibugs_>	 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4162148 (10awight)
[17:40:16] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode
[17:40:40] <_joe_>	 uhm
[17:41:06] <elukey>	 checking
[17:41:09] <volans>	 what's up?
[17:41:23] <apergos>	 I have the same question
[17:41:29] <elukey>	 just got the page, the hdfs namenode went down apparently
[17:41:36] <elukey>	 it should be failed over to an1002 in theory
[17:41:40] <elukey>	 going to check and report back
[17:45:59] <elukey>	 we are discussing in #analytics what to do, but basically it seems that there was a problem with the journal nodes and the hdfs namenode on an1001 decided to shutdown
[17:47:51] <wikibugs_>	 (03Abandoned) 10Chad: Move wiktionary and foundationwiki docroots to standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/402090 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad)
[17:48:26] <wikibugs_>	 (03PS3) 10Chad: Swap mediawiki.org to use standard docroot naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/421949
[17:50:13] <wikibugs_>	 (03PS4) 10Chad: Gerrit: Move all logging to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/423794
[17:50:23] <wikibugs_>	 (03CR) 10Paladox: "I wonder would this still work it being behind varnish?" [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad)
[17:51:36] <wikibugs_>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4162202 (10MarcoAurelio) 05Open>03Resolved a:03mobrovac Fixed by @mobrovac. Thanks.
[17:51:45] <chasemp>	 Ack thanks for the update elukey 
[17:53:52] <wikibugs_>	 (03CR) 10Smalyshev: "> Patch Set 2: Code-Review+1" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev)
[17:54:19] <wikibugs_>	 (03CR) 10Paladox: "> My only concern is that we'd also be exposing Gerrit itself, not" [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad)
[17:54:38] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good, we can give it a smoke test with mw1221 next week." [puppet] - 10https://gerrit.wikimedia.org/r/429229 (owner: 10Volans)
[17:56:41] <wikibugs_>	 (03PS1) 10Herron: install_server: reinstall mx1001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/429241 (https://phabricator.wikimedia.org/T175361)
[17:57:44] <wikibugs_>	 (03PS1) 10Imarlier: webperfX001: start using the webperf role [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774)
[17:58:47] <wikibugs_>	 (03CR) 10Imarlier: "Exactly the same change as 392030, with the exception of adding to the dsh target list for the webperf group.  Now safe due to no longer d" [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier)
[17:59:15] <wikibugs_>	 (03PS3) 10Chad: Gerrit: Run directly from deployment location [puppet] - 10https://gerrit.wikimedia.org/r/423801
[17:59:33] <wikibugs_>	 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure, 10ChangeProp, and 3 others: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4162216 (10mobrovac)
[18:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1800).
[18:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:05:47] <wikibugs_>	 (03PS1) 10Muehlenhoff: debdeploy-deploy: Sort modified packages [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/429244
[18:07:48] <icinga-wm>	 PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools]
[18:09:11] <moritzm>	 ^ argon is fine, systemd update logged above
[18:09:26] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode
[18:10:31] <wikibugs_>	 10Operations, 10fundraising-tech-ops, 10netops: New PFW policy - https://phabricator.wikimedia.org/T193189#4162232 (10ayounsi) 05Open>03Resolved a:03ayounsi Pushed. ``` $ nc -zv 208.80.155.8 22 Connection to 208.80.155.8 22 port [tcp/ssh] succeeded! ```
[18:12:02] <ottomata>	 !log reimaging (some?) kafka200* codfw main kafka nodes to stretch T192832
[18:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:09] <stashbot>	 T192832: Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832
[18:17:43] <wikibugs_>	 (03PS2) 10Muehlenhoff: Sort results of debdeploy-deploy, debdeploy-restarts and debdeploy-pkgversion [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/429244
[18:17:48] <wikibugs_>	 (03PS1) 10ArielGlenn: generate checksums on a per job basis, updating the hash as needed [dumps] - 10https://gerrit.wikimedia.org/r/429245
[18:18:29] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Remove obsolete mediawiki multimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/428634 (owner: 10Muehlenhoff)
[18:23:47] <elukey>	 so analytics1001 is back on track
[18:23:59] <elukey>	 we are not super sure of what happened to the journal nodes
[18:24:04] <elukey>	 but we are going to investigate it
[18:24:14] <elukey>	 also the page-all for those hosts is probably not great
[18:24:20] <elukey>	 so we'll remove the critical
[18:27:56] <wikibugs_>	 (03PS1) 10Elukey: profile::hadoop::master: avoid paging all for process down [puppet] - 10https://gerrit.wikimedia.org/r/429251
[18:28:22] <elukey>	 ottomata: --^
[18:29:14] <wikibugs_>	 (03CR) 10Elukey: [C: 032] profile::hadoop::master: avoid paging all for process down [puppet] - 10https://gerrit.wikimedia.org/r/429251 (owner: 10Elukey)
[18:29:49] <wikibugs_>	 (03PS1) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252
[18:30:32] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (owner: 10Imarlier)
[18:36:18] <wikibugs_>	 (03PS2) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252
[18:36:38] <wikibugs_>	 (03PS1) 10Cmjohnson: adding dhcpd and netboot.cfg for lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/429254 (https://phabricator.wikimedia.org/T184293)
[18:37:04] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (owner: 10Imarlier)
[18:37:12] <wikibugs_>	 (03PS3) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252
[18:37:47] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (owner: 10Imarlier)
[18:37:52] <icinga-wm>	 RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[18:38:39] <wikibugs_>	 (03PS4) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252
[18:39:13] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (owner: 10Imarlier)
[18:39:52] <wikibugs_>	 (03PS5) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774)
[18:41:59] <wikibugs_>	 (03PS2) 10Cmjohnson: adding dhcpd and netboot.cfg for lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/429254 (https://phabricator.wikimedia.org/T184293)
[18:42:40] <wikibugs_>	 (03CR) 10Cmjohnson: [C: 032] adding dhcpd and netboot.cfg for lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/429254 (https://phabricator.wikimedia.org/T184293) (owner: 10Cmjohnson)
[18:44:56] <wikibugs_>	 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4162375 (10Cmjohnson) @ayounsi Can you create a subnet for LVS for row D please.
[18:45:28] <wikibugs_>	 (03PS6) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774)
[18:51:24] <wikibugs_>	 (03PS7) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774)
[18:56:41] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 extra NIC connections - https://phabricator.wikimedia.org/T193196#4162403 (10chasemp)
[18:56:56] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 extra NIC connections - https://phabricator.wikimedia.org/T193196#4162415 (10chasemp) p:05Triage>03Normal
[18:58:59] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 extra NIC connections - https://phabricator.wikimedia.org/T193196#4162417 (10chasemp)
[18:59:04] <wikibugs_>	 (03PS8) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774)
[19:00:04] <jouncebot>	 no_justification: Time to snap out of that daydream and deploy MediaWiki train. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1900).
[19:03:37] <wikibugs_>	 (03PS9) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774)
[19:05:27] <no_justification>	 ^ misread that as cool things
[19:06:54] <wikibugs_>	 (03CR) 10Dzahn: "if the concern is exposing /r/, wouldn't that be the same whether we serve it directly or cache it?" [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad)
[19:07:56] <wikibugs_>	 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4162432 (10chasemp)
[19:08:02] <wikibugs_>	 (03CR) 10Chad: [C: 032] group2 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429216 (owner: 10Chad)
[19:08:09] <wikibugs_>	 (03PS1) 10Muehlenhoff: Switch scap proxy in C6 to mw1320 [puppet] - 10https://gerrit.wikimedia.org/r/429260
[19:08:25] <paladox>	 no_justification i think for that ^^ we could resolve the security concern by not using an alias, instead defining a new virtual host for gerrit.wmfusercontent.org and getting it to look into the avatars folder we want it to.
[19:09:20] <wikibugs_>	 (03Merged) 10jenkins-bot: group2 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429216 (owner: 10Chad)
[19:09:27] <no_justification>	 I wasn't going to use an alias, but yes, you're right.
[19:09:42] <wikibugs_>	 (03CR) 10jenkins-bot: group2 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429216 (owner: 10Chad)
[19:09:46] <no_justification>	 I wasn't thinking right
[19:10:04] <mutante>	 let's lookup the ticket that set it up for phab.wmfusercontent
[19:10:09] <Reedy>	 no_justification: uh. If you're gonna do .1...
[19:10:11] <mutante>	 maybe there are comments from traffic
[19:10:15] <Reedy>	 https://gerrit.wikimedia.org/r/#/c/429250/ https://phabricator.wikimedia.org/T193191
[19:10:50] <no_justification>	 Which nobody tagged as blocker :(
[19:11:06] <Reedy>	 Bad James_F :P
[19:11:19] <Reedy>	 (he was multi tasking)
[19:11:45] <mutante>	 paladox: RT !:p
[19:11:52] <paladox>	 heh
[19:11:57] <mutante>	 paladox: on https://phabricator.wikimedia.org/rOPUP351f9c354beca351bde5436abb67e880b696e2f3
[19:12:03] <mutante>	 the new certificate requested in RT: 8212
[19:12:14] <paladox>	 lol
[19:12:25] <paladox>	 We could use letsencrypt for this?
[19:12:44] <mutante>	 probably, yea
[19:13:01] <mutante>	 well, it depends
[19:13:06] <mutante>	 if we serve it directly, yes
[19:13:19] <mutante>	 if we want to do it like phab, no
[19:13:26] <paladox>	 oh
[19:13:27] <mutante>	 but we already have star.wmfusercontent.org no matter what
[19:13:32] <mutante>	 so .. we just use that 
[19:13:38] <mutante>	 *.
[19:15:46] <mutante>	 in 2014 "and i would request *.wmfusercontent.org right away, i" heh
[19:15:50] <mutante>	 it paid off i guess
[19:18:45] <mutante>	 paladox: so RT 8212 isn't in Phab because it's a procurement ticket. but that was the one to buy this cert
[19:18:53] <paladox>	 oh
[19:19:35] <mutante>	 but let's try these: RT 7483, RT 8345
[19:19:38] <no_justification>	 mutante: Can we differentiate by vhost or does it have to be by port?
[19:19:47] <wikibugs_>	 (03PS1) 10Ottomata: Use /etc/prometheus as config_dir for kafka broker jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/429262 (https://phabricator.wikimedia.org/T192832)
[19:19:59] <no_justification>	 I s'pose varnish sets the Host header.
[19:21:02] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Use /etc/prometheus as config_dir for kafka broker jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/429262 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata)
[19:21:09] <James_F>	 Sorry, yeah, should have tagged it.
[19:21:24] <ottomata>	 cmjohnson1:  merging your lvs1016 change
[19:21:33] <wikibugs_>	 (03PS1) 10Rush: openstack: fixup keystone logrotate [puppet] - 10https://gerrit.wikimedia.org/r/429264 (https://phabricator.wikimedia.org/T193048)
[19:21:39] <cmjohnson1>	 ottomata: thx...sorry got sidetracked
[19:22:46] <mutante>	 no_justification: by host, but there is varnish VCL code for that
[19:22:56] <mutante>	 no_justification: we should just try to copy the phab.wmfusercontent setup
[19:23:00] <mutante>	 but there is this:
[19:23:07] <mutante>	     // Block WP Zero users from accessing Phabricator uploads to prevent abuse
[19:23:11] <mutante>	     if (req.http.Host == "phab.wmfusercontent.org") {
[19:23:12] <mutante>	 .. etc
[19:23:18] <no_justification>	 We won't need that on gerrit
[19:23:22] <no_justification>	 Nobody can upload that here
[19:23:23] <wikibugs_>	 (03PS2) 10Rush: openstack: fixup keystone logrotate [puppet] - 10https://gerrit.wikimedia.org/r/429264 (https://phabricator.wikimedia.org/T193048)
[19:23:24] <paladox>	 WP zero is going away this year
[19:24:54] <paladox>	 and also per no_justification 
[19:25:04] <mutante>	 no_justification: if we put it behind varnish we can just use the *.wmfusercontent.org cert and done. if we serve it directly then that would require copying the star unified cert to the gerrit machine which we probably dont want to do
[19:25:42] <mutante>	 we could use Letsencrypt and serve it directly with some other name.. but the first option is what we got the wildcart cert for
[19:26:55] <mutante>	 we could also copy other parts, like  $altdom = hiera('phabricator_altdomain', 'phab.wmfusercontent.org'),  and related
[19:27:30] <wikibugs_>	 (03PS1) 10Muehlenhoff: Stop installing oggvideotools [puppet] - 10https://gerrit.wikimedia.org/r/429265
[19:29:55] <wikibugs_>	 (03PS3) 10Rush: openstack: fixup keystone logrotate [puppet] - 10https://gerrit.wikimedia.org/r/429264 (https://phabricator.wikimedia.org/T193048)
[19:29:56] <mutante>	 the varnish setup in hieradata/role/common/cache/misc.yaml has a single director called 'phabricator' and 2 host names are pointing to the same directory. the backend Apache then has virtual hosts
[19:30:07] <mutante>	 s/directory/director
[19:31:05] <paladox>	 https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/cache/misc.yaml#L101
[19:37:23] <no_justification>	 mutante: Yeah so the dns bit will be fine
[19:38:01] <wikibugs_>	 (03CR) 10Paladox: [C: 031] Add gerrit.wmfusercontent.org DNS entry [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad)
[19:38:17] <wikibugs_>	 (03PS2) 10Dzahn: Add gerrit.wmfusercontent.org DNS entry [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad)
[19:38:22] <wikibugs_>	 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4162491 (10MoritzMuehlenhoff) @Lea_WMDE : We're making good progress with the stretch migration, we should be good to start the wikidiff roll...
[19:38:41] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Add gerrit.wmfusercontent.org DNS entry [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad)
[19:40:19] <mutante>	 we should also have a redirect like http://phab.wmfusercontent.org/
[19:41:47] <mutante>	 but in Apache while Phab is doing it in PHP(?)
[19:43:03] <paladox>	 yep
[19:45:55] <wikibugs_>	 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#4162505 (10chasemp) 05Open>03Resolved These are now debian jessie shoutout to @robh for helping me work through some install issues :)
[19:46:10] <wikibugs_>	 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): tools-k8s-master-01 has two floating IPs - https://phabricator.wikimedia.org/T164123#4162508 (10chasemp) a:05chasemp>03None
[19:46:12] <paladox>	 could be:
[19:46:17] <paladox>	 RewriteCond %{HTTP_HOST} gerrit.wmfusercontent.org$
[19:46:18] <paladox>	 RewriteRule (.*) https://gerrit.wikimedia.org/ [P]
[19:46:22] <paladox>	 no_justification mutante ^^
[19:47:54] <no_justification>	 No....
[19:48:02] <no_justification>	 I'll do it later
[19:48:33] <paladox>	 ok
[19:56:30] <wikibugs_>	 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4162517 (10bd808) a:05madhuvishy>03None
[20:09:28] <logmsgbot>	 !log demon@tin rebuilt and synchronized wikiversions files: group2 to wmf.1
[20:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:57] <wikibugs_>	 (03CR) 10Brion VIBBER: [C: 031] "No longer in use since thumbnailing moved to thumbor (and falls back to ffmpeg anyway)" [puppet] - 10https://gerrit.wikimedia.org/r/429265 (owner: 10Muehlenhoff)
[20:29:31] <hashar>	 !log contint1001: cleaned up old Docker images produced by docker-pkg
[20:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:40] <wikibugs_>	 (03PS1) 10Andrew Bogott: bootstrap-vz: rearrange nscd/nslcd refreshes [puppet] - 10https://gerrit.wikimedia.org/r/429338
[20:34:29] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz: rearrange nscd/nslcd refreshes [puppet] - 10https://gerrit.wikimedia.org/r/429338 (owner: 10Andrew Bogott)
[20:38:44] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: 1.662e+05 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:38:54] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: 7.425e+04 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:39:04] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: 6.633e+04 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:39:44] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: (C)1.5e+04 ge (W)1e+04 ge 4821 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:39:54] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: (C)1.5e+04 ge (W)1e+04 ge 4872 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:40:05] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: (C)1.5e+04 ge (W)1e+04 ge 4205 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:45:32] <wikibugs_>	 (03Draft1) 10MarcoAurelio: idwikimedia: register on DNS [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726)
[20:45:35] <wikibugs_>	 (03PS2) 10MarcoAurelio: idwikimedia: register on DNS [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726)
[20:45:44] <wikibugs_>	 (03Draft1) 10MarcoAurelio: idwikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/429342 (https://phabricator.wikimedia.org/T192726)
[20:45:49] <wikibugs_>	 (03PS2) 10MarcoAurelio: idwikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/429342 (https://phabricator.wikimedia.org/T192726)
[20:46:43] <wikibugs_>	 (03CR) 10Rush: [C: 032] openstack: fixup keystone logrotate [puppet] - 10https://gerrit.wikimedia.org/r/429264 (https://phabricator.wikimedia.org/T193048) (owner: 10Rush)
[20:46:48] <wikibugs_>	 (03PS4) 10Rush: openstack: fixup keystone logrotate [puppet] - 10https://gerrit.wikimedia.org/r/429264 (https://phabricator.wikimedia.org/T193048)
[20:51:13] <wikibugs_>	 (03PS1) 10Herron: WIP: icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937)
[20:51:39] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937) (owner: 10Herron)
[20:52:56] <wikibugs_>	 (03PS2) 10Herron: WIP: icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937)
[20:54:48] <wikibugs_>	 (03PS2) 10Imarlier: coal: remove files that aren't needed any longer [puppet] - 10https://gerrit.wikimedia.org/r/428980 (https://phabricator.wikimedia.org/T191994)
[20:56:11] <wikibugs_>	 (03CR) 10Imarlier: "Tagging a bunch of people who have merge access to puppet -- sorry about the review-spam." [puppet] - 10https://gerrit.wikimedia.org/r/428980 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier)
[21:00:04] <jouncebot>	 MaxSem and kaldari: How many deployers does it take to do Redeploy ArticleCreationWorkflow deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T2100).
[21:00:06] <wikibugs_>	 (03PS3) 10MarcoAurelio: idwikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/429342 (https://phabricator.wikimedia.org/T192726)
[21:00:24] <MaxSem>	 no_justification: are we clear to proceed?
[21:00:36] <no_justification>	 Proceed....?
[21:00:49] <kaldari>	 just 1 I imagine
[21:01:01] <MaxSem>	 in other words, is your deployment done, no_justification?
[21:01:07] <no_justification>	 Been done
[21:01:12] <MaxSem>	 wee
[21:01:54] <kaldari>	 MaxSem: I have a 1:1 meeting with Danny right now. Do you need me for testing or do you have it covered?
[21:02:14] <MaxSem>	 I guess I can handle it kaldari 
[21:05:59] <logmsgbot>	 !log maxsem@tin Synchronized php-1.32.0-wmf.1/extensions/ArticleCreationWorkflow/: https://gerrit.wikimedia.org/r/#/c/429111/ (duration: 01m 00s)
[21:06:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:33] <wikibugs_>	 (03PS3) 10MaxSem: Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455)
[21:07:38] <wikibugs_>	 (03CR) 10MaxSem: [C: 032] Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem)
[21:09:01] <wikibugs_>	 (03Merged) 10jenkins-bot: Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem)
[21:09:54] <wikibugs_>	 (03CR) 10jenkins-bot: Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem)
[21:13:18] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/429017 (duration: 00m 59s)
[21:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:03] <logmsgbot>	 !log maxsem@tin Started scap: Deploy ACW to test wikis, https://gerrit.wikimedia.org/r/429017 / T192455
[21:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:09] <stashbot>	 T192455: Permanently implement autoconfirmed-account-requirement for new article creation on en.wiki  - https://phabricator.wikimedia.org/T192455
[21:20:57] <wikibugs_>	 (03PS1) 10Bstorm: wiki replicas: add GRANT statement to $wiki_p database creation [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490)
[21:21:36] <wikibugs_>	 (03PS1) 10Andrew Bogott: bootstrapvz: one more attempt to properly order nscd and nslcd restarts [puppet] - 10https://gerrit.wikimedia.org/r/429350
[21:22:19] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] bootstrapvz: one more attempt to properly order nscd and nslcd restarts [puppet] - 10https://gerrit.wikimedia.org/r/429350 (owner: 10Andrew Bogott)
[21:29:45] <wikibugs_>	 (03PS2) 10MaxSem: Redeploy ArticleCreationWorkflow, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455)
[21:38:34] <wikibugs_>	 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#2078914 (10Pcoombe) The store HSTS header now has `max-age=31557600`, but still no `includeSubDomains` or `preload`.
[21:43:35] <wikibugs_>	 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4162818 (10MarcoAurelio) @EddieGP deployment-cpjobqueue puppet was broken due to disk full; this was f...
[21:44:28] <MaxSem>	 hmm, "Updating LocalisationCache for 1.32.0-wmf.1 using 10 thread(s)"'s been running for 30 minutes now :O
[21:47:19] <brion>	 is that still running slow due to hhvm?
[21:48:06] <MaxSem>	 yup
[21:48:26] <brion>	 booo
[21:48:43] <MaxSem>	 didn't realise that the 40 minute long run reported is for a single step
[21:51:42] <wikibugs_>	 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4162850 (10MarcoAurelio) Also, there are stalled jobs in the `job` table:  ``` wikiadmin@deployment-db...
[22:11:10] <logmsgbot>	 !log maxsem@tin Finished scap: Deploy ACW to test wikis, https://gerrit.wikimedia.org/r/429017 / T192455 (duration: 57m 06s)
[22:11:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:11:17] <stashbot>	 T192455: Permanently implement autoconfirmed-account-requirement for new article creation on en.wiki  - https://phabricator.wikimedia.org/T192455
[22:12:11] <wikibugs_>	 (03CR) 10Niharika29: [C: 031] "You'd have to SWAT this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt)
[22:13:50] <wikibugs_>	 (03CR) 10MaxSem: [C: 032] Redeploy ArticleCreationWorkflow, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem)
[22:13:52] <wikibugs_>	 (03CR) 10Niharika29: "This will go out in SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem)
[22:15:06] <wikibugs_>	 (03Merged) 10jenkins-bot: Redeploy ArticleCreationWorkflow, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem)
[22:19:42] <wikibugs_>	 (03CR) 10jenkins-bot: Redeploy ArticleCreationWorkflow, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem)
[22:21:43] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/429100/ (duration: 01m 00s)
[22:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:27] <MaxSem>	 kaldari: we're live ^
[22:22:37] <kaldari>	 cool
[22:23:37] <wikibugs_>	 (03Abandoned) 10Jcrespo: Add ferm service for mariadb_dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/316341 (owner: 10Muehlenhoff)
[22:24:34] <kaldari>	 MaxSem: I confirmed that it's working
[22:30:28] <wikibugs_>	 (03PS1) 10Jgreen: A/PTR for frbast1001.wikimedia.org and service cnames for frack bastions [dns] - 10https://gerrit.wikimedia.org/r/429354 (https://phabricator.wikimedia.org/T193178)
[22:34:11] <wikibugs_>	 (03CR) 10Jgreen: [C: 032] A/PTR for frbast1001.wikimedia.org and service cnames for frack bastions [dns] - 10https://gerrit.wikimedia.org/r/429354 (https://phabricator.wikimedia.org/T193178) (owner: 10Jgreen)
[22:38:35] <Jeff_Green>	 !log deployed DNS update for frbast1001.wikimedia.org
[22:38:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:05] <brion>	 hey, anybody know if we changed poppler-utils package recently?
[22:40:33] <brion>	 getting reports of new PDFs failing to render as '0x0' which symptom matches up with having a newer pdfinfo command
[22:40:48] <brion>	 i have a fix in the works for PdfHandler to call pdfinfo correctly for both old and new versions
[22:41:10] <brion>	 but it'd be nice to confirm if the package changed recently
[22:44:40] <ebernhardson>	 !log start test measuring elasticsearch master mutation latency in codfw
[22:44:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:45:20] <Reedy>	 Jesus
[22:45:25] <Reedy>	 That page is well over a meg
[22:48:10] <brion>	 you get a meg
[22:48:12] <brion>	 and YOU get a meg
[22:57:00] <wikibugs_>	 (03PS1) 10Subramanya Sastry: Enable RemexHtml on wikis with <100 issues in high-priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429357 (https://phabricator.wikimedia.org/T192299)
[23:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T2300).
[23:00:04] <jouncebot>	 Niharika and brion: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:18] <brion>	 \o/
[23:01:19] <brion>	 oh can i add one more real quick?
[23:01:59] <brion>	 or should i wait to test it more :D
[23:02:23] <p858snake>	 testing? do it in production >.> *runs away*
[23:03:25] <brion>	 agh, i can't log in to wikitech on this laptop. yay keys
[23:03:40] <brion>	 anyway, https://gerrit.wikimedia.org/r/#/c/429356/ is a hotfix for PdfHandler
[23:03:45] <brion>	 but it can wait if necessary
[23:04:47] * Reedy looks at the swat queue
[23:05:16] <wikibugs_>	 (03PS3) 10Reedy: Enable CodeMirror on RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) (owner: 10Niharika29)
[23:05:20] <Niharika>	 o/
[23:05:25] <Niharika>	 I'm here. 
[23:05:41] <wikibugs_>	 (03CR) 10Reedy: [C: 032] Enable CodeMirror on RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) (owner: 10Niharika29)
[23:07:05] <wikibugs_>	 (03Merged) 10jenkins-bot: Enable CodeMirror on RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) (owner: 10Niharika29)
[23:07:46] <Reedy>	 Niharika: Do you care about it going on mwdebug?
[23:08:47] <Niharika>	 Reedy: No preference.
[23:09:02] <Niharika>	 You'll bear the blame if things break. :P
[23:09:41] <wikibugs_>	 (03CR) 10jenkins-bot: Enable CodeMirror on RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) (owner: 10Niharika29)
[23:10:04] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 01m 00s)
[23:10:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:15] <Reedy>	 stupid paste fail
[23:10:40] <Reedy>	 codemirror is on arwiki now
[23:11:57] <Niharika>	 Reedy: All RTL wikis, right?
[23:12:47] <Reedy>	 yup
[23:13:38] <Niharika>	 Thanks! 
[23:14:23] <Reedy>	 brion: Just want the UW patch pushing everywhere too?
[23:14:50] <brion>	 Reedy: yep, it's a fix for a previous patch so should go out all-wheres
[23:15:04] <Reedy>	 Sorry, I mean, do you want it mwdebug first?
[23:15:11] <brion>	 ah :D
[23:15:19] <brion>	 nah just put it out
[23:16:57] <logmsgbot>	 !log reedy@tin Synchronized php-1.32.0-wmf.1/extensions/UploadWizard/: (no justification provided) (duration: 01m 00s)
[23:17:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:35] <brion>	 heh
[23:22:18] <Reedy>	 Booo
[23:22:32] <Reedy>	 your pdfhandler patch won't trivially cherry pick to 1.30 or earlier
[23:22:51] <brion>	 yeah it changed from a large string to small strings i think
[23:22:56] <brion>	 will have to manually backport
[23:23:17] <brion>	 i don't think we need that for prod though, so i'll do tomorrow
[23:23:38] <Reedy>	 parameter splitting into arrays and stuff
[23:23:57] <brion>	 yep
[23:24:43] <Reedy>	 I'm surprised though
[23:24:48] <Reedy>	 cherry pick on cli
[23:24:52] <Reedy>	 it finds the right file etc
[23:24:55] <Reedy>	 (cause it's renamed)
[23:26:04] <brion>	 rename detection ftw
[23:27:09] <wikibugs_>	 (03PS2) 10Dzahn: webperfX001: start using the webperf role [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier)
[23:28:45] <wikibugs_>	 (03CR) 10Dzahn: "this removes the "perf-roots" admin group from the host specific files and the role has the admin group "perf-team". So they are not the s" [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier)
[23:31:00] <mutante>	 why do we have "perf-team" and "perf-roots" admin groups if both of them have the exact same privileges (root) and the members also overlap, heh
[23:31:36] <logmsgbot>	 !log reedy@tin Synchronized php-1.32.0-wmf.1/extensions/PdfHandler/: (no justification provided) (duration: 01m 00s)
[23:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:00] <mutante>	 ah, because perf-roots are applied on way more things.. right
[23:32:38] <wikibugs_>	 (03CR) 10Dzahn: "nevermind, i see both have the same privileges but are used in a different context. lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier)
[23:32:42] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] webperfX001: start using the webperf role [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier)
[23:32:54] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 330 MB (3% inode=75%)
[23:35:13] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] "on webperf1001/2001 all the users have been created, packages have been installed.. there is just an error that it fails to start the stat" [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier)
[23:36:23] <wikibugs_>	 (03PS3) 10Dzahn: coal: remove files that aren't needed any longer [puppet] - 10https://gerrit.wikimedia.org/r/428980 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier)
[23:36:24] <icinga-wm>	 PROBLEM - Check systemd state on webperf1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:36:44] <icinga-wm>	 PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:37:08] <mutante>	 Krinkle: ^ roles applied , looks like it just needed 2 puppet runs
[23:37:18] <mutante>	 after the first one statsv wasnt running but now it is
[23:37:25] <icinga-wm>	 RECOVERY - Check systemd state on webperf1001 is OK: OK - running: The system is fully operational
[23:38:03] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] coal: remove files that aren't needed any longer [puppet] - 10https://gerrit.wikimedia.org/r/428980 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier)
[23:38:44] <icinga-wm>	 RECOVERY - Check systemd state on webperf2001 is OK: OK - running: The system is fully operational
[23:46:52] <wikibugs_>	 (03CR) 10Dbarratt: "> You'd have to SWAT this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt)
[23:48:08] <wikibugs_>	 (03PS3) 10Dbarratt: Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962)
[23:49:01] <marlier>	 mutante: thanks for the merge! Statsv is scap deployed, which is why it's not starting. I'm not at a computer right now (it's about 8pm here), but can take care of that in a bit. Is there anything alerting as a result? If so, can it just be silenced? 
[23:52:02] <marlier>	 Oh, hey, looks like it addressed itself. Interesting! Not sure how, but all good. 
[23:53:07] <marlier>	 These things are all atomic, either by design or via Kafka commit/statsd