[00:00:05] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T0000). [00:54:19] (03CR) 10Eevans: "> While I agree that no harm would come from increasing the map count" [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans) [01:19:28] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:19:29] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:19:38] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:19:38] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:19:38] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:20:09] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:23:39] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [01:23:48] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4159963 (10Peachey88) [01:25:29] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [01:25:38] RECOVERY - Disk space on stat1005 is OK: DISK OK [01:25:38] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:25:38] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [01:26:09] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [01:26:28] RECOVERY - DPKG on stat1005 is OK: All packages OK [01:28:39] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:33:06] (03CR) 10MaxSem: [C: 031] Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt) [01:35:00] " Unable to run wmf-auto-reimage-host: Failed to puppet_generate_certs [01:35:03] :/ [02:02:09] PROBLEM - HHVM processes on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [02:02:10] PROBLEM - nutcracker port on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [02:03:49] PROBLEM - HHVM rendering on mw2163 is CRITICAL: connect to address 10.192.32.51 and port 80: Connection refused [02:03:50] PROBLEM - nutcracker process on mw2163 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [02:04:37] ^ scheduling downtime [02:05:17] !log mw2163 through mw2166: since the wmf-auto-reimage failed after OS but before puppet run due to "Failed to puppet_generate_certs" i manually logged in with install-console and signed puppet certs (T174431) [02:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:25] T174431: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431 [02:13:09] PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524708783 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4778 keys, up 4 minutes 11 seconds - replication_delay is 1524708783 [02:14:09] PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [02:14:19] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1524708856 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708436 keys, up 4 minutes 32 seconds - replication_delay is 1524708856 [02:15:40] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1524708936 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4707459 keys, up 4 minutes 46 seconds - replication_delay is 1524708936 [02:16:10] PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524708967 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4778 keys, up 7 minutes 15 seconds - replication_delay is 1524708967 [02:16:49] PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524708998 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4787 keys, up 4 minutes 45 seconds - replication_delay is 1524708998 [02:17:19] RECOVERY - HHVM processes on mw2163 is OK: PROCS OK: 1 process with command name hhvm [02:17:20] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1524709036 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4710337 keys, up 4 minutes 31 seconds - replication_delay is 1524709036 [02:19:09] RECOVERY - HHVM rendering on mw2163 is OK: HTTP OK: HTTP/1.1 200 OK - 74979 bytes in 7.812 second response time [02:19:10] PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [02:20:10] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1524709206 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4606618 keys, up 4 minutes 20 seconds - replication_delay is 1524709206 [02:21:09] PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1524709262 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 7845 keys, up 4 minutes 9 seconds - replication_delay is 1524709262 [02:21:19] PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [02:24:09] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524709443 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5822 keys, up 2 minutes 4 seconds - replication_delay is 1524709443 [02:24:59] PROBLEM - Check systemd state on rdb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:28:23] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524709687 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5822 keys, up 6 minutes 8 seconds - replication_delay is 1524709687 [02:29:13] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524709750 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4704 keys, up 3 minutes 59 seconds - replication_delay is 1524709750 [02:32:22] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524709939 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 7346 keys, up 4 minutes 7 seconds - replication_delay is 1524709939 [02:32:23] PROBLEM - HHVM rendering on mw2164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:34:03] PROBLEM - Apache HTTP on mw2166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:40:32] ACKNOWLEDGEMENT - Apache HTTP on mw2166 is CRITICAL: connect to address 10.192.32.54 and port 80: Connection refused daniel_zahn reinstall [02:40:32] ACKNOWLEDGEMENT - Nginx local proxy to apache on mw2166 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.150 second response time daniel_zahn reinstall [02:42:03] RECOVERY - Apache HTTP on mw2166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 5.683 second response time [02:44:42] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4707866 keys, up 31 minutes 51 seconds - replication_delay is 12 [02:45:32] RECOVERY - HHVM rendering on mw2164 is OK: HTTP OK: HTTP/1.1 200 OK - 75654 bytes in 5.214 second response time [02:47:02] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [02:54:42] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 612 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4707866 keys, up 41 minutes 51 seconds - replication_delay is 612 [02:57:33] RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 5492 keys, up 48 minutes 37 seconds - replication_delay is 21 [03:00:54] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708123 keys, up 51 minutes 3 seconds - replication_delay is 6 [03:01:04] PROBLEM - Host mw2163 is DOWN: PING CRITICAL - Packet loss = 100% [03:01:45] RECOVERY - nutcracker port on mw2163 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [03:01:49] ACKNOWLEDGEMENT - Host mw2163 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reboot [03:01:54] RECOVERY - Host mw2163 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [03:02:34] RECOVERY - nutcracker process on mw2163 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [03:03:44] RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8368 keys, up 46 minutes 44 seconds - replication_delay is 35 [03:05:45] RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 5734 keys, up 50 minutes 14 seconds - replication_delay is 23 [03:07:44] PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 631 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 5492 keys, up 58 minutes 47 seconds - replication_delay is 631 [03:08:54] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708205 keys, up 56 minutes 4 seconds - replication_delay is 48 [03:10:14] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:10:44] PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [03:10:54] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 609 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708123 keys, up 1 hours 1 minutes - replication_delay is 609 [03:11:45] RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 5633 keys, up 56 minutes 14 seconds - replication_delay is 45 [03:13:44] PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 636 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8368 keys, up 56 minutes 45 seconds - replication_delay is 636 [03:13:54] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [03:14:54] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708175 keys, up 1 hours 2 minutes - replication_delay is 34 [03:16:34] RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5366 keys, up 1 hours 4 minutes - replication_delay is 26 [03:21:54] PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 649 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 5633 keys, up 1 hours 6 minutes - replication_delay is 649 [03:24:05] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708095 keys, up 1 hours 14 minutes - replication_delay is 41 [03:24:55] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 637 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708175 keys, up 1 hours 12 minutes - replication_delay is 637 [03:26:34] PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 629 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5366 keys, up 1 hours 14 minutes - replication_delay is 629 [03:27:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 722.16 seconds [03:30:04] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708079 keys, up 1 hours 17 minutes - replication_delay is 50 [03:34:05] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 639 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708095 keys, up 1 hours 24 minutes - replication_delay is 639 [03:38:05] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708131 keys, up 1 hours 28 minutes - replication_delay is 6 [03:40:04] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708079 keys, up 1 hours 27 minutes - replication_delay is 651 [03:44:05] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [03:44:34] RECOVERY - Check systemd state on rdb1004 is OK: OK - running: The system is fully operational [03:45:05] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708099 keys, up 1 hours 35 minutes - replication_delay is 50 [03:51:44] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4606513 keys, up 1 hours 35 minutes - replication_delay is 29 [03:52:14] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708049 keys, up 1 hours 39 minutes - replication_delay is 0 [03:54:39] <_joe_> !log stopping redis replication from eqiad to codfw for the jobqueue cluster, we have an issue ongoing with CirrusSearch jobs and replication is broken [03:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:14] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 654 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708099 keys, up 1 hours 45 minutes - replication_delay is 654 [03:59:05] RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8021 keys, up 1 hours 42 minutes - replication_delay is 34 [04:01:44] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 631 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4606513 keys, up 1 hours 45 minutes - replication_delay is 631 [04:02:14] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 602 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708049 keys, up 1 hours 49 minutes - replication_delay is 602 [04:02:44] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4606513 keys, up 1 hours 46 minutes - replication_delay is 4 [04:04:05] PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [04:05:05] RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 7976 keys, up 1 hours 48 minutes - replication_delay is 35 [04:05:55] RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5366 keys, up 44 seconds [04:06:04] PROBLEM - confd service on rdb2005 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [04:06:14] RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 5492 keys, up 1 minutes 1 seconds [04:06:14] RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 5633 keys, up 1 minutes 1 seconds [04:06:15] PROBLEM - confd service on rdb2001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [04:06:15] <_joe_> expected ^^ I just stopped confd there [04:06:33] <_joe_> in order to be able to disable replication [04:06:44] PROBLEM - confd service on rdb2003 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [04:08:24] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708049 keys, up 1 minutes 6 seconds [04:08:24] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4708099 keys, up 1 minutes 7 seconds [04:09:15] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [04:09:24] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [04:09:24] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6480 [04:09:34] <_joe_> wat [04:11:15] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5822 keys, up 6 minutes 6 seconds [04:12:24] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6480 [04:12:24] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [04:13:24] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4704 keys, up 8 minutes 9 seconds [04:13:24] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 7138 keys, up 8 minutes 9 seconds [04:13:24] RECOVERY - Check health of redis instance on 6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5162 keys, up 1 hours 59 minutes [04:13:54] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4707459 keys, up 2 hours 3 minutes [04:24:30] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4160039 (10Smalyshev) 05Open>03Resolved [04:29:54] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 187.18 seconds [04:31:08] <_joe_> SMalyshev: around? [04:31:33] <_joe_> if so, can you ping ebernhardson please? we have what I think are huge issues with the elasticsearch cluster in codfw [04:31:43] yes here [04:32:03] <_joe_> It's very early here and getting on the phone would wake up everyone in the house [04:32:13] <_joe_> I'm opening a UBN! ticket now [04:32:14] unfortunately I don't know any way to get to ebernhardson except IRC/email... [04:32:27] <_joe_> the office wiki contact list I guess [04:32:29] maybe gehel is online or will be soon? [04:32:35] ah, I'll try [04:33:03] <_joe_> heh I guess guillame will be around in a few hours [04:33:10] <_joe_> it's 6.33 AM here [04:36:04] _joe_: pinged, he's on his way [04:36:11] <_joe_> thanks [04:36:31] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Search-Platform-Programs: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4160062 (10Joe) [04:38:13] ACKNOWLEDGEMENT - confd service on rdb2001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto T193112 [04:38:13] ACKNOWLEDGEMENT - confd service on rdb2003 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto T193112 [04:38:13] ACKNOWLEDGEMENT - confd service on rdb2005 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive Giuseppe Lavagetto T193112 [04:39:49] !log unfreeze writes to elasticsearch codfw cluster [04:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:41] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Search-Platform-Programs: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4160076 (10EBernhardson) It looks like writes were frozen to the codfw clust... [04:49:27] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Search-Platform-Programs: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4160077 (10EBernhardson) I suppose we should lower the drop timeout, in `$wg... [05:03:14] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [05:03:44] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [05:18:03] !log Deploy schema change on dbstore1002:s2 - T191519 T188299 T190148 [05:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:11] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:18:11] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:18:11] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:19:24] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [05:19:55] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:24:00] zeljkof: You've posted on sev MMV patches, that jenkins failure comes from (now) merged task. Jenkins still fails, f.e. on https://gerrit.wikimedia.org/r/364175 [05:32:06] 08Warning Alert for device cr1-ulsfo.wikimedia.org - Inbound interface errors [05:32:58] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Search-Platform-Programs: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4160124 (10Joe) The queue is getting back to normal sizes, and the job produ... [05:33:05] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [05:33:34] oh wow that yellow is bright [05:33:35] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [05:39:39] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4121650 (10Legoktm) >>! In T191921#4122327, @Joe wrote: > What are the blockers for the use of PHP7? > > All I see on the ticket mentioned is the memc... [05:42:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-ulsfo.wikimedia.org recovered from Inbound interface errors [06:02:15] (03PS1) 10Marostegui: s1.hosts: Add db1116:3311 [software] - 10https://gerrit.wikimedia.org/r/429135 (https://phabricator.wikimedia.org/T190704) [06:04:08] Volker_E: I think that problem is resolved [06:04:44] Well, is there a problem, or are my comments confusing? [06:07:51] nice, we have librenms reporting here now [06:09:00] zeljkof: I hoped for your help hunting down, what's currently stopping jenkins… [06:11:54] Volker_E: on the phone now, will check in an hour or so [06:12:26] zeljkof: great, not pressing [06:19:17] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=elasticsearch [06:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:44] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:51:19] 10Operations, 10Discovery, 10Discovery-Analysis, 10Product-Analytics, 10Discovery-Search (Current work): Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967#4160213 (10Gehel) a:05Gehel>03None [07:02:17] (03PS1) 10Muehlenhoff: Remove Madhu from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/429142 [07:04:05] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [07:04:14] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:04:51] (03PS1) 10Jcrespo: mariadb: Depool db1090, repool db1122 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429143 (https://phabricator.wikimedia.org/T192979) [07:05:42] (03CR) 10Muehlenhoff: [C: 032] Remove Madhu from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/429142 (owner: 10Muehlenhoff) [07:07:45] (03PS1) 10Muehlenhoff: Remove access credentials for Madhu [puppet] - 10https://gerrit.wikimedia.org/r/429144 [07:10:58] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for Madhu [puppet] - 10https://gerrit.wikimedia.org/r/429144 (owner: 10Muehlenhoff) [07:15:20] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1090, repool db1122 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429143 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [07:16:43] (03Merged) 10jenkins-bot: mariadb: Depool db1090, repool db1122 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429143 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [07:16:52] !log re-enabling puppet on rdb2* - T193112 [07:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:58] T193112: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112 [07:19:42] (03CR) 10jenkins-bot: mariadb: Depool db1090, repool db1122 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429143 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [07:20:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1090, pool db1122 with full weight (duration: 01m 23s) [07:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:38] !log restarting redis masters in codfw - T193112 [07:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:50] RECOVERY - confd service on rdb2001 is OK: OK - confd is active [07:23:19] RECOVERY - confd service on rdb2003 is OK: OK - confd is active [07:24:39] RECOVERY - confd service on rdb2005 is OK: OK - confd is active [07:25:10] PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [07:25:35] ^ that's my restart, should be back up already [07:26:10] RECOVERY - Check health of redis instance on 6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3868 keys, up 1 minutes 37 seconds - replication_delay is 0 [07:27:29] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1524727642 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4606514 keys, up 33 seconds - replication_delay is 1524727642 [07:27:59] PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524727674 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 5633 keys, up 24 seconds - replication_delay is 1524727674 [07:28:09] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5822 keys, up 3 hours 22 minutes [07:28:09] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4704 keys, up 3 hours 22 minutes [07:28:10] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 6556 keys, up 3 hours 22 minutes [07:28:19] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1524727687 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4708049 keys, up 1 minutes 21 seconds - replication_delay is 1524727687 [07:28:19] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [07:28:39] PROBLEM - Check health of redis instance on 6478 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6478 has 1 databases (db0) with 4 keys, up 3 hours 23 minutes [07:28:39] PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524727715 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5366 keys, up 31 seconds - replication_delay is 1524727715 [07:28:59] PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1524727734 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 5492 keys, up 53 seconds - replication_delay is 1524727734 [07:29:09] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4706278 keys, up 2 minutes 19 seconds - replication_delay is 0 [07:29:29] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4706329 keys, up 2 minutes 38 seconds - replication_delay is 0 [07:29:29] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4604947 keys, up 2 minutes 35 seconds - replication_delay is 0 [07:29:39] RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3450 keys, up 1 minutes 31 seconds - replication_delay is 0 [07:29:59] RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3791 keys, up 1 minutes 47 seconds - replication_delay is 0 [07:30:00] RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3582 keys, up 1 minutes 53 seconds - replication_delay is 0 [07:36:19] RECOVERY - Check health of redis instance on 6478 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6478 has 1 databases (db0) with 4 keys, up 7 seconds - replication_delay is 6 [07:37:40] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3726 keys, up 1 minutes 22 seconds - replication_delay is 0 [07:38:40] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3371 keys, up 2 minutes 19 seconds - replication_delay is 0 [07:38:40] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 5133 keys, up 2 minutes 13 seconds - replication_delay is 0 [07:41:58] (03PS1) 10Jcrespo: mariadb: Convert db1090 into a core multiinstance host [puppet] - 10https://gerrit.wikimedia.org/r/429148 (https://phabricator.wikimedia.org/T192979) [07:43:18] (03PS1) 10Jcrespo: dbhosts: Convert db1090 into a core multiinstance host [software] - 10https://gerrit.wikimedia.org/r/429149 [07:45:02] !log stopping db1090 mariadb instance to move its path, port and socket [07:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:32] (03PS2) 10Jcrespo: mariadb: Convert db1090 into a core multiinstance host [puppet] - 10https://gerrit.wikimedia.org/r/429148 (https://phabricator.wikimedia.org/T192979) [07:47:35] (03CR) 10Jcrespo: [C: 032] mariadb: Convert db1090 into a core multiinstance host [puppet] - 10https://gerrit.wikimedia.org/r/429148 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [07:49:24] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem) [07:50:34] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem) [07:51:32] (03CR) 10Marostegui: [C: 032] s1.hosts: Add db1116:3311 [software] - 10https://gerrit.wikimedia.org/r/429135 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:52:22] (03Merged) 10jenkins-bot: s1.hosts: Add db1116:3311 [software] - 10https://gerrit.wikimedia.org/r/429135 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [07:53:27] (03CR) 10Jcrespo: [C: 032] dbhosts: Convert db1090 into a core multiinstance host [software] - 10https://gerrit.wikimedia.org/r/429149 (owner: 10Jcrespo) [07:58:00] !log Deploy schema change on db1090 - T191519 T188299 T190148 [07:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:07] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [07:58:07] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [07:58:07] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [07:58:34] 10Operations, 10Wikispeech, 10Wikispeech-WMSE: TTS server deployment strategy - https://phabricator.wikimedia.org/T193072#4160271 (10akosiaris) > So, Operations can you tell @Lokal_Profil whether docker-compose is a valid deployment strategy? Or if they need to do so something else... A valid deployment str... [08:04:31] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [08:05:02] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:05:39] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Include db1116:3311 in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429150 (https://phabricator.wikimedia.org/T190704) [08:07:12] (03PS1) 10Muehlenhoff: Stop using mw-no-tmp.cfg partman recipe and remove it [puppet] - 10https://gerrit.wikimedia.org/r/429151 (https://phabricator.wikimedia.org/T156955) [08:08:05] (03CR) 10Jcrespo: [C: 04-1] "not a mediawiki host, shouldn't be on config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429150 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [08:11:08] (03Abandoned) 10Marostegui: db-eqiad,db-codfw.php: Include db1116:3311 in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429150 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [08:19:03] (03CR) 10Filippo Giunchedi: [C: 031] Stop using mw-no-tmp.cfg partman recipe and remove it [puppet] - 10https://gerrit.wikimedia.org/r/429151 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:19:19] (03PS1) 10Muehlenhoff: Remove graphite-dmcache.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429152 (https://phabricator.wikimedia.org/T156955) [08:24:33] !log stop and upgrade db1109 [08:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:34] 10Operations, 10OTRS: Upgrade OTRS to 5.0.27 - https://phabricator.wikimedia.org/T193118#4160296 (10akosiaris) [08:25:56] 10Operations, 10OTRS: Upgrade OTRS to 5.0.27 - https://phabricator.wikimedia.org/T193118#4160310 (10akosiaris) p:05Triage>03Normal [08:26:34] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#4160313 (10akosiaris) [08:26:47] (03CR) 10Filippo Giunchedi: [C: 031] Remove graphite-dmcache.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:26:53] 10Operations, 10OTRS: Upgrade OTRS to 5.0.27 - https://phabricator.wikimedia.org/T193118#4160296 (10akosiaris) [08:29:51] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [08:30:22] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:30:42] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#4160318 (10akosiaris) p:05Triage>03Low Since I 'll probably be the one to work on this, and I need to estimate the amount of work (both operationally as well as agent testing)... [08:32:23] !log re-attempt reimage of mw1246 (failed yesterday with an error on the puppetmaster, testing whether this can be reproduced) [08:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:39] (03PS1) 10Jcrespo: mariadb: Depool db1069 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429153 (https://phabricator.wikimedia.org/T192979) [08:44:21] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:44:42] PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:45:22] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:46:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:47:06] 08Warning Alert for device cr1-ulsfo.wikimedia.org - Inbound interface errors [08:47:46] (03PS1) 10Jcrespo: mariadb: Depool db1086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429154 (https://phabricator.wikimedia.org/T192979) [08:47:48] 10Operations, 10OTRS: Upgrade OTRS to 5.0.27 - https://phabricator.wikimedia.org/T193118#4160326 (10akosiaris) I 've checked the OTRS templates we maintain (https://gerrit.wikimedia.org/g/operations/software/otrs/+/refs/heads/master) and there is no change in them for 5.0.27. So we stay with version `1.0.11` o... [08:49:42] RECOVERY - HTTP availability for Varnish on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:50:11] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [08:50:21] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:50:50] (03PS1) 10Jcrespo: install_server: Reimage db1109 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/429155 [08:51:13] !log reimaging mw1320, mw1321, mw1322 (app servers) to stretch [08:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:45] (03CR) 10Jcrespo: [C: 032] install_server: Reimage db1109 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/429155 (owner: 10Jcrespo) [08:52:41] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [08:53:42] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:54:21] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [08:55:31] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:56:06] 10Operations, 10OTRS: Upgrade OTRS to 5.0.27 - https://phabricator.wikimedia.org/T193118#4160328 (10akosiaris) 05Open>03Resolved The upgrade was easy enough so I just completed it with minimal downtime (a few secs). [08:56:36] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429154 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [08:57:01] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:57:50] (03Merged) 10jenkins-bot: mariadb: Depool db1086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429154 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [08:59:27] (03CR) 10jenkins-bot: mariadb: Depool db1086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429154 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [09:01:17] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1086 (duration: 01m 16s) [09:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:04] (03PS2) 10Muehlenhoff: Stop including mediawiki::packages::multimedia for contint [puppet] - 10https://gerrit.wikimedia.org/r/428314 [09:02:24] !log Drop prefswitch_survey on s5 and s6 - T173439 [09:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:07] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-ulsfo.wikimedia.org recovered from Inbound interface errors [09:09:05] (03PS2) 10Muehlenhoff: Remove graphite-dmcache.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429152 (https://phabricator.wikimedia.org/T156955) [09:11:03] (03CR) 10Muehlenhoff: [C: 032] Remove graphite-dmcache.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:13:48] !log Drop prefswitch_survey on s4 - T173439 [09:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:31] (03CR) 10ArielGlenn: "This looks fine as far as it goes, but what I had in mind was that these files for each batch be kept around for failed shards and re-used" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [09:15:32] !log Temp disabling cr1-ulsfo:xe-1/2/0 (Chicago transport) due to stability issues [09:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:05] !log Drop prefswitch_survey on s2 - T173439 [09:16:07] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [09:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:07] PROBLEM - Apache HTTP on mw1320 is CRITICAL: connect to address 10.64.32.41 and port 80: Connection refused [09:27:07] PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: connect to address 10.64.32.43 and port 443: Connection refused [09:27:07] PROBLEM - Check size of conntrack table on mw1321 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:27:08] PROBLEM - MD RAID on mw1320 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:27:08] PROBLEM - Check systemd state on mw1322 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:27:26] ^silencing [09:30:14] !log Drop prefswitch_survey on s7 - T173439 [09:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:07] (03PS1) 10Jcrespo: mariadb: Repool db1109 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429156 [09:35:08] 10Operations: Upgrade ganeti hosts to stretch - https://phabricator.wikimedia.org/T193121#4160393 (10akosiaris) [09:35:17] 10Operations: Upgrade ganeti hosts to stretch - https://phabricator.wikimedia.org/T193121#4160406 (10akosiaris) p:05Triage>03Normal [09:38:03] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1109 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429156 (owner: 10Jcrespo) [09:39:22] (03Merged) 10jenkins-bot: mariadb: Repool db1109 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429156 (owner: 10Jcrespo) [09:39:49] (03CR) 10jenkins-bot: mariadb: Repool db1109 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429156 (owner: 10Jcrespo) [09:40:22] (03CR) 10Gehel: [C: 031] "LGTM" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [09:41:43] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1109 with low load (duration: 01m 16s) [09:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:08] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [09:43:10] (03CR) 10Hoo man: "> This looks fine as far as it goes, but what I had in mind was that these files for each batch be kept around for failed shards and re-us" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [09:43:50] (03CR) 10Hoo man: Wikidata JSON dump: Only dump batches of ~400,000 pages at once (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [09:45:36] !log Drop prefswitch_survey on s3 - T173439 [09:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:30] !log eqiad-prod: more weight to ms-be104[0-3] for container/account - T190081 [09:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:38] T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081 [09:51:01] 10Operations, 10User-fgiunchedi: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4160427 (10fgiunchedi) [09:53:27] (03PS1) 10Muehlenhoff: Use mw-raid1.cfg partman recipe for mw1221-mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/429158 (https://phabricator.wikimedia.org/T106381) [09:57:49] !log Drop prefswitch_survey on s1 - T173439 [09:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:18] RECOVERY - Check size of conntrack table on mw1321 is OK: OK: nf_conntrack is 0 % full [09:59:27] RECOVERY - MD RAID on mw1320 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [10:00:23] (03PS2) 10Muehlenhoff: Stop using mw-no-tmp.cfg partman recipe and remove it [puppet] - 10https://gerrit.wikimedia.org/r/429151 (https://phabricator.wikimedia.org/T156955) [10:00:45] (03CR) 10Filippo Giunchedi: [C: 031] Use mw-raid1.cfg partman recipe for mw1221-mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/429158 (https://phabricator.wikimedia.org/T106381) (owner: 10Muehlenhoff) [10:02:27] RECOVERY - Nginx local proxy to apache on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 5.895 second response time [10:04:17] (03CR) 10Hashar: [C: 031] Stop including mediawiki::packages::multimedia for contint [puppet] - 10https://gerrit.wikimedia.org/r/428314 (owner: 10Muehlenhoff) [10:07:29] RECOVERY - Check systemd state on mw1322 is OK: OK - running: The system is fully operational [10:09:50] (03CR) 10Muehlenhoff: [C: 032] Stop using mw-no-tmp.cfg partman recipe and remove it [puppet] - 10https://gerrit.wikimedia.org/r/429151 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:29:18] !log reimaging mw1269, mw1323, mw1324 (app servers) to stretch [10:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:32] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4160536 (10MarcoAurelio) [10:35:39] !log reimaging mw1312 mw1317, mw1339 (API servers) to stretch [10:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:11] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4160592 (10MarcoAurelio) p:05Triage>03High [10:53:34] (03CR) 10Mobrovac: [C: 031] Enable eventbus Kafka producer snappy compression [puppet] - 10https://gerrit.wikimedia.org/r/429007 (https://phabricator.wikimedia.org/T193080) (owner: 10Ottomata) [10:59:02] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4160845 (10MarcoAurelio) ``` maurelio@deployment-cpjobqueue:/$ sudo du -sh / du: cannot access ‘/proc/32319/task/32319/fd/3’: No such file or directory du: cannot access ‘/proc/32319/task... [11:04:30] (03CR) 10Mobrovac: "IMHO it would be simpler to set the default in cassandra::sysctl to what we deem to be a good value for cassandra clusters in general and " [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans) [11:05:06] (03PS3) 10Muehlenhoff: Stop including mediawiki::packages::multimedia for contint [puppet] - 10https://gerrit.wikimedia.org/r/428314 [11:05:44] (03CR) 10Muehlenhoff: [C: 032] Stop including mediawiki::packages::multimedia for contint [puppet] - 10https://gerrit.wikimedia.org/r/428314 (owner: 10Muehlenhoff) [11:10:55] RECOVERY - Kafka Broker Replica Max Lag on kafka1001 is OK: (C)5e+05 ge (W)1e+05 ge 8.509e+04 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001 [11:14:45] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4160890 (10MarcoAurelio) @Joe Any idea how to fix this? Delete `/var/vda3`? Maybe some work is also needed on `tmpfs`? [11:16:36] RECOVERY - Kafka Broker Replica Max Lag on kafka1003 is OK: (C)5e+05 ge (W)1e+05 ge 9.862e+04 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [11:19:01] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4160899 (10fdans) OK, so to determine the periodicity of the cron job, I ran a city query over ~17,000 IP addresses with: - The most current GeoIP d... [11:22:05] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4160932 (10faidon) As far as periodicity goes, note that MaxMind [[ https://support.maxmind.com/geoip-faq/geoip2-and-geoip-legacy-databases/how-often-a... [11:22:12] 10Operations: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155#4160919 (10Volans) [11:22:21] 10Operations: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155#4160936 (10Volans) p:05Triage>03Normal [11:27:56] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4160939 (10MarcoAurelio) File's plently of: ``` {"name":"cpjobqueue","hostname":"deployment-cpjobqueue","pid":136,"level":50,"err":{"message":"KafkaConsumer is not connected","name":"cpj... [11:27:57] (03PS3) 10Arturo Borrero Gonzalez: labs_bootstrapvz: address labtest issues [puppet] - 10https://gerrit.wikimedia.org/r/428694 (https://phabricator.wikimedia.org/T181523) [11:31:03] jouncebot, next [11:31:08] In 1 hour(s) and 28 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1300) [11:45:11] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:45:20] (03CR) 10Muehlenhoff: [C: 031] "Looks good, just couple of small comments." (036 comments) [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 (https://phabricator.wikimedia.org/T192763) (owner: 10Filippo Giunchedi) [11:45:41] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:45:51] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:46:01] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:46:02] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:46:23] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [11:47:09] ^ looking [11:49:03] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [11:49:13] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [11:49:32] RECOVERY - DPKG on stat1005 is OK: All packages OK [11:49:52] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [11:51:14] some number crunching job on stat1005 consumed all memory and nrpe failed to fork/service failed, puppet run restarted it correctly [11:51:22] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:03:01] !log mobrovac@tin Started deploy [cpjobqueue/deploy@7fbb152]: Support the exclude_topics config stanza [12:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:13] !log mobrovac@tin Finished deploy [cpjobqueue/deploy@7fbb152]: Support the exclude_topics config stanza (duration: 01m 12s) [12:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:04] (03PS4) 10Arturo Borrero Gonzalez: labs_bootstrapvz: address labtest issues [puppet] - 10https://gerrit.wikimedia.org/r/428694 (https://phabricator.wikimedia.org/T181523) [12:14:21] (03CR) 10Arturo Borrero Gonzalez: [C: 032] labs_bootstrapvz: address labtest issues [puppet] - 10https://gerrit.wikimedia.org/r/428694 (https://phabricator.wikimedia.org/T181523) (owner: 10Arturo Borrero Gonzalez) [12:18:48] (03PS4) 10Muehlenhoff: Don't include mediawiki::multimedia on labweb* [puppet] - 10https://gerrit.wikimedia.org/r/428298 [12:25:08] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/11041/" [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff) [12:31:56] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.133 second response time [12:38:28] 10Operations, 10monitoring: Monitor the BIOS boot order and parameters - https://phabricator.wikimedia.org/T193160#4161046 (10Volans) p:05Triage>03Normal [12:39:56] (03PS2) 10Muehlenhoff: Use mw-raid1.cfg partman recipe for mw1221-mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/429158 (https://phabricator.wikimedia.org/T106381) [12:42:45] (03CR) 10Muehlenhoff: [C: 032] Use mw-raid1.cfg partman recipe for mw1221-mw1258 [puppet] - 10https://gerrit.wikimedia.org/r/429158 (https://phabricator.wikimedia.org/T106381) (owner: 10Muehlenhoff) [12:42:48] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4161078 (10Milimetric) thanks @faidon, we were just seeing if maybe the accuracy of the old databases is really high, we can schedule the jobs less oft... [12:48:24] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429174 (https://phabricator.wikimedia.org/T190148) [12:49:32] jouncebot: next [12:49:32] In 0 hour(s) and 10 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1300) [12:50:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429174 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [12:51:09] !log reindexing lost updates on elasticsearch - T193112 [12:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:15] T193112: Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112 [12:51:23] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429174 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [12:51:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429174 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [12:52:35] no more jerkins [12:53:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1113 for alter table (duration: 01m 33s) [12:53:20] !log Deploy schema change on db1113:3312 - T191519 T188299 T190148 [12:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:30] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [12:53:30] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [12:53:30] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [12:53:56] (03PS1) 10Muehlenhoff: Remove partman fallback for mediawiki hosts to single disk partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/429175 (https://phabricator.wikimedia.org/T106381) [12:59:17] I'm mostly afk for 2-3 hours, need to get the kids from day care today/look after them throughout the afternoon, I'll be around again later [12:59:36] wrong channel :-) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1300). [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] here [13:00:24] I can SWAT today [13:00:44] Urbanecm: I'll ping you when a patch is ready at mwdebug, anything special today? :) [13:01:07] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#2345955 (10JAllemandou) +1 for weekly on wednesday. Thanks @fdans and @faidon :) [13:01:49] zeljkof, can you please both patches directly to production? First one is just replacement of static resources, the second one is testable only to those who have "import right" [13:01:56] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1972 bytes in 0.133 second response time [13:02:09] (BTW I'll have third patch, if possible...) [13:02:44] Urbanecm: sure, the first patch needs cache purges, right? [13:03:00] Urbanecm: third patch should not be a problem [13:04:18] zeljkof, yes [13:04:21] (03PS1) 10Urbanecm: Change chapcomwiki's logo, add HD logo for chapcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429178 (https://phabricator.wikimedia.org/T193024) [13:04:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] "Any special reason why CirrusSearch uses the full /bigdata/namespace/… URL while WikibaseQualityConstraints uses the /sparql shortcut? :)" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [13:05:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428953 (https://phabricator.wikimedia.org/T193028) (owner: 10Urbanecm) [13:05:49] zeljkof, added to the calendar. [13:06:04] Urbanecm: ok, I see it [13:06:20] ok [13:06:53] (03Merged) 10jenkins-bot: Fix pixelization of new wiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428953 (https://phabricator.wikimedia.org/T193028) (owner: 10Urbanecm) [13:09:07] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:428953|Fix pixelization of new wiki logos (T193028)]] (duration: 01m 17s) [13:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:14] T193028: Logos for new wikis are pixelized - https://phabricator.wikimedia.org/T193028 [13:10:43] Urbanecm: 428953 is deployed, you should see the updates at mwdebug, purging cache [13:11:03] (03CR) 10jenkins-bot: Fix pixelization of new wiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428953 (https://phabricator.wikimedia.org/T193028) (owner: 10Urbanecm) [13:11:10] ok [13:13:03] Urbanecm: cache purged, please check [13:13:21] Working, thanks [13:14:04] (03PS1) 10Jcrespo: mariadb: Switch db1086 row format to statement [puppet] - 10https://gerrit.wikimedia.org/r/429182 (https://phabricator.wikimedia.org/T192979) [13:14:10] (03PS3) 10Zfilipin: Add all Hindi projects plus meta as import sources for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428952 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [13:14:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428952 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [13:16:27] (03Merged) 10jenkins-bot: Add all Hindi projects plus meta as import sources for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428952 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [13:17:14] (03CR) 10jenkins-bot: Add all Hindi projects plus meta as import sources for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428952 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [13:17:21] (03CR) 10Jcrespo: [C: 032] mariadb: Switch db1086 row format to statement [puppet] - 10https://gerrit.wikimedia.org/r/429182 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [13:18:02] (03PS2) 10Zfilipin: Change chapcomwiki's logo, add HD logo for chapcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429178 (https://phabricator.wikimedia.org/T193024) (owner: 10Urbanecm) [13:19:09] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:428952|Add all Hindi projects plus meta as import sources for hiwikimedia (T188366)]] (duration: 01m 17s) [13:19:12] Urbanecm: 428952 is deployed, but nothing you can do, right? [13:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:16] T188366: Create Hindi Wikimedian User Group Site - https://phabricator.wikimedia.org/T188366 [13:19:25] Yes, I'm not an sysop/steward. [13:19:47] Urbanecm: 429178 is testable at mwdebug, or deploying directly? [13:20:16] Please deploy directly as well [13:20:29] Urbanecm: ok, will ping you when deployed and cache purged [13:20:33] ack [13:22:40] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429178 (https://phabricator.wikimedia.org/T193024) (owner: 10Urbanecm) [13:22:54] Urbanecm: argh, [13:22:57] forgot [13:23:00] what happened? [13:23:04] argh, new keyboard :D [13:23:16] space and enter very near :D [13:23:29] anyway, I forgot about the new rule for deployments, and merged 429178 [13:23:44] it should not be deployed by new rules, it would have to be split into two commits [13:23:45] I'm not aware about a new rule [13:23:48] let me find the diff [13:24:16] Urbanecm: https://wikitech.wikimedia.org/w/index.php?title=SWAT_deploys&type=revision&diff=1789212&oldid=1777024 [13:24:18] (03PS1) 10Jcrespo: mariadb: Switch db1086 row format to statement, this time for real [puppet] - 10https://gerrit.wikimedia.org/r/429184 [13:24:38] no problem for this patch, and it's my mistake for merging it, but for future reference [13:24:41] (03Merged) 10jenkins-bot: Change chapcomwiki's logo, add HD logo for chapcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429178 (https://phabricator.wikimedia.org/T193024) (owner: 10Urbanecm) [13:24:42] Sure [13:24:54] (03PS2) 10Jcrespo: mariadb: Switch db1086 row format to statement, this time for real [puppet] - 10https://gerrit.wikimedia.org/r/429184 [13:24:58] (03CR) 10jenkins-bot: Change chapcomwiki's logo, add HD logo for chapcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429178 (https://phabricator.wikimedia.org/T193024) (owner: 10Urbanecm) [13:25:01] Urbanecm: see T187761 for info [13:25:01] T187761: Proposal: Effective immediately, disallow multi-sync patch deployment - https://phabricator.wikimedia.org/T187761 [13:25:13] Well...it can be single-sync [13:25:15] scap sync :D [13:25:27] Will have a look [13:25:32] well, for this kind of patch, it's an overkill [13:25:44] It is, but it will give you just one sync :) [13:25:47] and I am aware of dependencies, so no problem [13:26:12] What should be the patch-size in the future? Upload static files in one patch and change IS.php in another patch (depending on the first one)? [13:26:22] Urbanecm: yes [13:26:51] Well...at least now I really cannot see the rationale, but I'll read the task and ask later if I'll have questions [13:27:01] the point is one sync per patch, since that is how our CI tests changes, we had some problems with patches that require multiple syncs [13:27:33] (03CR) 10Jcrespo: [C: 032] mariadb: Switch db1086 row format to statement, this time for real [puppet] - 10https://gerrit.wikimedia.org/r/429184 (owner: 10Jcrespo) [13:27:38] comment on the task if you have any questions, I think one of your patches (similar to this one) is an example there [13:29:04] saw it :) [13:30:10] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:429178|Change chapcomwikis logo, add HD logo for chapcomwiki (T193024)]] (duration: 01m 16s) [13:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:16] T193024: Change AffCom Wiki logo - https://phabricator.wikimedia.org/T193024 [13:31:37] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:429178|Change chapcomwikis logo, add HD logo for chapcomwiki (T193024)]] (duration: 01m 16s) [13:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:06] Urbanecm: 429178 deployed, cache purged, please check and thanks for deploying with #releng again ;) [13:34:20] One of weeks when I'm using every EU SWAT :D [13:34:22] will do [13:34:44] that is worth a t-shirt :D [13:35:05] I did not break wikipedia, but I have tried [13:35:07] ;) [13:35:16] !log EU SWAT finished [13:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:32] Working, thanks for your deployment [13:35:57] In fact, the SWAT process didn't do what is required from it [13:36:18] "SWAT (...) is responsible for breaking the site on a regular basis" [13:36:22] :D [13:36:34] we do our best, but the tooling these days... [13:36:42] does not let you shoot yourself in the foot [13:37:44] Well...delete /srv/mediawiki-stagging/wmf-config/InitialiseSettings.php and I'm sure something will be broken :D [13:38:24] (of course with a sync :)) [13:41:56] can I work with mediawiki config, right? [13:42:01] *I can [13:46:34] (03PS1) 10Jcrespo: mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979) [13:47:43] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [13:48:07] (03PS2) 10Jcrespo: mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979) [13:51:31] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [13:51:33] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/429175 (https://phabricator.wikimedia.org/T106381) (owner: 10Muehlenhoff) [13:53:45] (03Merged) 10jenkins-bot: mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [13:55:46] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1069, repool db1086 (duration: 01m 16s) [13:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:17] !log Compress enwiki on db1116:3311 - T190704 [13:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:23] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [13:58:26] (03PS4) 10Fdans: Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 [13:59:48] (03CR) 10jenkins-bot: mariadb: Depool db1069, repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429192 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [14:05:13] (03PS1) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) [14:05:53] (03CR) 10jerkins-bot: [V: 04-1] pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn) [14:06:37] (03CR) 10Ottomata: "I also notice that throughout the script and puppet class, you refer to file paths as "DIR" "LOCATION" and "ROUTE". Let's be consistent! " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428390 (owner: 10Fdans) [14:20:00] fdans: for fun you can check out this old test someone made of the maxmind database and its accuracy: https://meta.wikimedia.org/wiki/MaxMindCityTesting [14:20:08] if you want to *really* have fun, you can update the results [14:22:21] milimetric: wow that really does sound like fun [14:23:40] :) [14:25:02] !log stop db1069 for cloning it away [14:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:30] milimetric: with our world wide coverage, maybe one day we will end up building our own geoip database :] [14:26:07] hashar: while that'd be cool, I'm not sure we have quite enough coverage :) [14:26:07] !log Running populateRevisionLength.php on group 1 for T192189 [14:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:14] T192189: RevisionArchiveRecord incorrectly changes null ar_len to 0 - https://phabricator.wikimedia.org/T192189 [14:27:12] anomie: how long was ar_len set incorrectly for? And is this script updating it for all history when you're done? [14:30:38] milimetric: rev_len was set incorrectly for undeletions that happened between 1.31.0-wmf.23 and 1.31.0-wmf.30 (fixed in 1.32.0-wmf.1). Subsequent deletions may have copied the error to ar_len, and moves or other things that copy the existing revision row might have copied the incorrect value. This is updating the whole history, yes. [14:31:13] great, thanks anomie [14:31:59] milimetric: ... Clarification: that's only undeletions of old revisions where ar_len was NULL, not all undeletions. [14:32:35] And this run will also be populating all those old revisions where rev_len or ar_len is null. [14:33:43] that's ok, I was checking to see how it would affect stats, and it's a relatively short period of time, and relatively small set of articles impacted, so while numbers will change I don't think it'll shift the overall metrics too much [14:37:51] (03CR) 10Eevans: "> IMHO it would be simpler to set the default in cassandra::sysctl to" [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans) [14:51:47] !log ppchelko@tin Started deploy [changeprop/deploy@f2f7a84]: Commit offsets for non matched messages from time to time. [14:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:14] !log ppchelko@tin Finished deploy [changeprop/deploy@f2f7a84]: Commit offsets for non matched messages from time to time. (duration: 01m 26s) [14:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:38] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:02:48] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:02:59] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:03:09] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:03:18] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:03:28] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:03:49] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:09:24] (03PS1) 10Arturo Borrero Gonzalez: labs_bootstrapvz: firstboot.sh: bring back some resolv.conf magic [puppet] - 10https://gerrit.wikimedia.org/r/429211 (https://phabricator.wikimedia.org/T181523) [15:09:33] (03PS1) 10Jcrespo: mariadb: Add db1090:s7 to configuration [puppet] - 10https://gerrit.wikimedia.org/r/429212 (https://phabricator.wikimedia.org/T192979) [15:10:49] (03PS1) 10Jcrespo: dbhosts: Add db1090:s7 to configuration [software] - 10https://gerrit.wikimedia.org/r/429213 [15:13:10] (03CR) 10Marostegui: [C: 031] mariadb: Add db1090:s7 to configuration [puppet] - 10https://gerrit.wikimedia.org/r/429212 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:13:22] (03CR) 10Marostegui: [C: 031] dbhosts: Add db1090:s7 to configuration [software] - 10https://gerrit.wikimedia.org/r/429213 (owner: 10Jcrespo) [15:14:05] (03PS1) 10Jcrespo: mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) [15:16:21] (03CR) 10Marostegui: mariadb: Change db1090 to be a multiinstance host for s2 and s7 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:17:36] (03CR) 10Jcrespo: "I had no plans to do the repooling yet, but I agree better to add it now (even depooled) to avoid accidents." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:18:37] !log added LDAP user tschumann to "nda" group (T192549) [15:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:44] T192549: LDAP access for group 'nda' for Tobias Schumann (WMDE) - https://phabricator.wikimedia.org/T192549 [15:21:45] (03PS2) 10Jcrespo: mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) [15:23:06] (03CR) 10Jcrespo: "^ping" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:23:19] (03CR) 10Marostegui: mariadb: Change db1090 to be a multiinstance host for s2 and s7 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:25:28] RECOVERY - Disk space on stat1005 is OK: DISK OK [15:25:28] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [15:25:30] (03PS3) 10Jcrespo: mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) [15:25:39] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [15:25:48] (03CR) 10Marostegui: [C: 031] mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:25:49] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [15:25:55] (03CR) 10Jcrespo: "Not a pain, I think it is very useful. I may have asked you something similar in the past." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:26:09] RECOVERY - DPKG on stat1005 is OK: All packages OK [15:26:18] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [15:27:33] (03CR) 10Jcrespo: [C: 032] mariadb: Add db1090:s7 to configuration [puppet] - 10https://gerrit.wikimedia.org/r/429212 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:28:07] (03CR) 10Jcrespo: [C: 032] dbhosts: Add db1090:s7 to configuration [software] - 10https://gerrit.wikimedia.org/r/429213 (owner: 10Jcrespo) [15:28:28] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:29:07] (03PS1) 10Chad: group2 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429216 [15:29:28] (03CR) 10Chad: [C: 04-2] "for later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429216 (owner: 10Chad) [15:30:02] (03CR) 10Jcrespo: [C: 032] mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:30:11] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4161601 (10Lea_WMDE) [15:31:37] (03Merged) 10jenkins-bot: mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:31:53] (03CR) 10jenkins-bot: mariadb: Change db1090 to be a multiinstance host for s2 and s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429214 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:33:24] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4161611 (10Lea_WMDE) @MoritzMuehlenhoff as discussed I'm checking in at the end of April :) Is there any news about the wikidiff2 update sche... [15:33:33] jynus: marostegui: can i steal tin for a mwconfig deploy from you for 10-15 mins? [15:33:49] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Thu 2018-04-26 15:33:47 UTC. [15:33:50] can you wait 3 minutes? [15:33:55] I just merged a change [15:34:31] sure sure [15:34:39] just ping me when you are done [15:35:19] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:19] !log jynus@tin Synchronized wmf-config/db-codfw.php: Add db1090 as multiinstance (duration: 01m 17s) [15:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:50] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Add db1090 as multiinstance (duration: 01m 16s) [15:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:06] (03PS7) 10Mobrovac: Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:40:06] there could be issues? [15:40:23] (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:41:28] but it is not the deploy [15:41:43] there are issues with refreshcount deadlocks [15:41:55] (03CR) 10Mobrovac: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:42:31] mobrovac: you can continue [15:42:46] kk thnx jynus [15:46:31] (03CR) 10Mobrovac: [C: 032] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:48:19] (03PS1) 10Ottomata: Temporarly remove partman recipe for kafka main hosts [puppet] - 10https://gerrit.wikimedia.org/r/429218 (https://phabricator.wikimedia.org/T192832) [15:49:19] (03CR) 10Ottomata: [C: 032] Temporarly remove partman recipe for kafka main hosts [puppet] - 10https://gerrit.wikimedia.org/r/429218 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [15:49:50] 10Operations, 10fundraising-tech-ops, 10netops: NAT for new fundraising bastion - https://phabricator.wikimedia.org/T193177#4161644 (10cwdent) [15:51:17] 10Operations, 10fundraising-tech-ops, 10netops: NAT for new fundraising bastion - https://phabricator.wikimedia.org/T193177#4161671 (10cwdent) [15:51:30] (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:54:22] (03CR) 10Mobrovac: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:55:39] (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:55:59] (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [15:56:40] (03PS3) 10Filippo Giunchedi: profile: install SMART checks after 'raid' fact is available. [puppet] - 10https://gerrit.wikimedia.org/r/428947 (https://phabricator.wikimedia.org/T132324) [15:56:42] (03PS1) 10Filippo Giunchedi: memcached: deprecate Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454) [15:56:47] (03PS1) 10Filippo Giunchedi: elasticsearch: deprecate Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/429222 (https://phabricator.wikimedia.org/T183454) [15:56:49] (03PS1) 10Filippo Giunchedi: ores: deprecate Diamond redis collector [puppet] - 10https://gerrit.wikimedia.org/r/429223 (https://phabricator.wikimedia.org/T183454) [15:56:54] (03PS1) 10Filippo Giunchedi: Deprecate Diamond pdns collectors [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454) [15:56:56] let's see how many -1s [15:56:58] (03PS1) 10Filippo Giunchedi: Deprecate Diamond tcpconnstate and nfconntrackcount [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) [15:57:24] (03CR) 10jerkins-bot: [V: 04-1] memcached: deprecate Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [15:57:36] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: deprecate Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/429222 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [15:58:06] (03CR) 10jerkins-bot: [V: 04-1] Deprecate Diamond pdns collectors [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [15:58:11] (03CR) 10Awight: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/429223 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [15:58:40] hashar: what's wrong with jenkins? - https://integration.wikimedia.org/ci/job/operations-mw-config-typos/19085/console [15:58:53] i keep getting that for different tests [15:59:26] (03CR) 10Mobrovac: [V: 032 C: 032] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [16:00:04] godog, moritzm, and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1600). [16:00:04] thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:51] (03CR) 10jenkins-bot: Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [16:01:42] * thcipriani waves [16:01:48] I am around [16:02:04] is it as trivial as it looks? [16:02:10] !log ppchelko@tin Started deploy [cpjobqueue/deploy@bf34e00]: Enable all jobs for test, test2, testwikidata and mediawiki. T190327 [16:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:16] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [16:02:23] jynus: please wait a sec, still syncing [16:02:33] 30 seconds, to be precise [16:02:50] jynus: indeed it should be :) it's already on beta installs a program and modifies a config value that isn't used in prod yet. [16:03:02] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@bf34e00]: Enable all jobs for test, test2, testwikidata and mediawiki. T190327 (duration: 00m 51s) [16:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:10] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: JobQueue: Use EventBus for most jobs for test wikis - T190327 (duration: 01m 15s) [16:03:12] we will wait for mobrovac to finish using scap, ok? [16:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:27] sounds good to me :) [16:03:31] (03PS2) 10Filippo Giunchedi: memcached: deprecate Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454) [16:03:33] (03PS2) 10Filippo Giunchedi: elasticsearch: deprecate Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/429222 (https://phabricator.wikimedia.org/T183454) [16:03:35] (03PS2) 10Filippo Giunchedi: ores: deprecate Diamond redis collector [puppet] - 10https://gerrit.wikimedia.org/r/429223 (https://phabricator.wikimedia.org/T183454) [16:03:37] (03PS2) 10Filippo Giunchedi: Deprecate Diamond pdns collectors [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454) [16:03:39] (03PS2) 10Filippo Giunchedi: Deprecate Diamond tcpconnstate and nfconntrackcount [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) [16:03:40] thcipriani: I guess it will touch mainly terbium and the other passive deployment hosts, right? [16:03:47] ok i'm done, jynus you are good to go [16:03:56] (03PS2) 10ArielGlenn: pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) [16:04:40] (03PS2) 10Jcrespo: Scap: MediaWiki Canary: setup swagger checks [puppet] - 10https://gerrit.wikimedia.org/r/428721 (https://phabricator.wikimedia.org/T136839) (owner: 10Thcipriani) [16:05:37] jynus: it will update the scap.cfg on many hosts, but that should mostly be a no-op. The only place it installs new software should be on tin, naos, and the new deployment machine, deployment1001 [16:05:44] s/mostly// [16:05:45] (03CR) 10jerkins-bot: [V: 04-1] pull phabricator dumps from phab server to dumps web server [puppet] - 10https://gerrit.wikimedia.org/r/429197 (https://phabricator.wikimedia.org/T188726) (owner: 10ArielGlenn) [16:06:03] sorry, I actually meant tin [16:06:14] yes [16:06:20] (03CR) 10Mobrovac: "Ok, there are two things here, so I'll try to address them separately." [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans) [16:06:28] my mind is distracted [16:06:57] (03CR) 10Jcrespo: [C: 032] Scap: MediaWiki Canary: setup swagger checks [puppet] - 10https://gerrit.wikimedia.org/r/428721 (https://phabricator.wikimedia.org/T136839) (owner: 10Thcipriani) [16:08:05] running puppet on tin [16:08:09] 10Operations, 10fundraising-tech-ops, 10netops: NAT for new fundraising bastion - https://phabricator.wikimedia.org/T193177#4161775 (10ayounsi) a:03ayounsi ```lang=diff [edit security nat static rule-set static-nat] + rule frbast1001 { + match { + destination-address 208.80.155.8... [16:08:20] 10Operations, 10fundraising-tech-ops, 10netops: NAT for new fundraising bastion - https://phabricator.wikimedia.org/T193177#4161777 (10ayounsi) 05Open>03Resolved [16:08:51] thcipriani: for testing, do we deploy something? [16:10:48] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4161795 (10BBlack) Note this will involve a planned ulsfo site outage, with its traffic falling back to codfw. If things go well the outage should be brief, the 5h estimate above is... [16:10:51] (03PS1) 10Volans: wmf-auto-reimage: verify BIOS boot parameters [puppet] - 10https://gerrit.wikimedia.org/r/429229 [16:10:53] (03PS1) 10Volans: wmf-auto-reimage: allow to mask systemd services [puppet] - 10https://gerrit.wikimedia.org/r/429230 [16:11:07] (03PS1) 10Jcrespo: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 [16:11:40] (03CR) 10Gehel: [C: 031] "All good, we're not using those metrics directly anymore." [puppet] - 10https://gerrit.wikimedia.org/r/429222 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [16:12:34] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 (owner: 10Jcrespo) [16:12:38] PROBLEM - mediawiki-installation DSH group on mw1229 is CRITICAL: Host mw1229 is not in mediawiki-installation dsh group [16:12:45] jynus: currently scap doesn't use this, so just making sure that service-checker-swagger was installed is mostly the test :) [16:12:51] ah, ok [16:12:59] (03CR) 10Chad: Add gerrit.wmfusercontent.org DNS entry (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad) [16:13:19] let me do the dummy deploy anyway to test there was no regression :-) [16:13:43] sure thing, never a bad plan :) [16:14:19] (03PS3) 10Eevans: cassandra: increase `vm.max_map_count` to 1048575 [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) [16:15:00] has anyone ever seen the CI error "stderr: error: unable to write file wmf-config/wikitech.php" [16:18:25] (03PS2) 10Jcrespo: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 [16:19:38] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 (owner: 10Jcrespo) [16:21:40] (03PS3) 10Jcrespo: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 [16:22:56] (03CR) 10Dzahn: "hmm.. i tend to say let's use misc varnish because phab.wmfusercontent.org does as well" [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad) [16:23:05] (03PS4) 10Jcrespo: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 [16:23:16] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4161855 (10Krinkle) The currently known run-time issues with MediaWiki on PHP7 and/or HHVM have been fixed (mainly T184854)... [16:23:43] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4161858 (10Krinkle) I've put a straw-man up at T176370#4161855. [16:24:19] showing https://integration.wikimedia.org/ci/job/operations-mw-config-typos/19088/console for "line is too long" [16:24:25] is a bit missleading [16:25:05] (03CR) 10Jcrespo: [C: 032] mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 (owner: 10Jcrespo) [16:26:27] (03Merged) 10jenkins-bot: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 (owner: 10Jcrespo) [16:26:53] (03PS5) 10Andrew Bogott: Don't include mediawiki::multimedia on labweb* [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff) [16:28:52] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Fix comment, test scap (duration: 01m 12s) [16:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:55] (03CR) 10jenkins-bot: mariadb: Fix comment on db1090:s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429231 (owner: 10Jcrespo) [16:31:26] (03PS1) 10EBernhardson: Lower CirrusSearch delayed job drop to 2 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429233 [16:33:12] 10Operations, 10netops: ulsfo<->eqord BGP down - https://phabricator.wikimedia.org/T192114#4161917 (10ayounsi) 05Open>03Resolved TTL fixed. Sessions up. [16:35:18] (03CR) 10Andrew Bogott: [C: 032] Don't include mediawiki::multimedia on labweb* [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff) [16:36:09] (03CR) 10Andrew Bogott: [C: 032] "Thanks for your patience with this special case" [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff) [16:38:47] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [16:39:22] (03CR) 10Chad: "My only concern is that we'd also be exposing Gerrit itself, not just the non-proxied webroot stuff." [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad) [16:40:57] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:41:08] PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:42:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:42:57] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [16:43:57] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:45:08] RECOVERY - HTTP availability for Varnish on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:46:35] (03PS2) 10Muehlenhoff: Remove obsolete mediawiki::packages::fonts from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428886 [16:50:33] (03PS2) 10Herron: scap::target: List allowed service commands, instead of wildcard [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier) [16:51:07] (03CR) 10jerkins-bot: [V: 04-1] scap::target: List allowed service commands, instead of wildcard [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier) [16:51:10] (03PS3) 10Herron: scap::target: List allowed service commands, instead of wildcard [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier) [16:51:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:51:57] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [16:57:34] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete mediawiki::packages::fonts from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428886 (owner: 10Muehlenhoff) [16:57:40] (03CR) 10Herron: [C: 031] "Added a few more commands. Seems ok to me but would like feedback from RelEng before merging." [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1700). [17:00:14] no parsoid deploy today [17:00:21] ORES has a slightly exciting bunch of work to deploy. [17:00:49] !log installing systemd SUA update for stretch [17:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:20] (03CR) 10DCausse: [C: 031] Lower CirrusSearch delayed job drop to 2 hours [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429233 (owner: 10EBernhardson) [17:04:12] (03PS1) 10Andrew Bogott: bootstrap_vz: re-order the ldap phases [puppet] - 10https://gerrit.wikimedia.org/r/429239 [17:04:49] (03PS2) 10Ottomata: Enable eventbus Kafka producer snappy compression [puppet] - 10https://gerrit.wikimedia.org/r/429007 (https://phabricator.wikimedia.org/T193080) [17:05:54] (03CR) 10Muehlenhoff: "When it's all standardised to a common recipe, we can simply apply this to mw[12]*" [puppet] - 10https://gerrit.wikimedia.org/r/429175 (https://phabricator.wikimedia.org/T106381) (owner: 10Muehlenhoff) [17:05:57] !log awight@tin Started deploy [ores/deploy@5b27205]: ORES: update to revscoring 2.2.2, T192917 [17:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:03] T192917: Rebuild all models for revscoring 2.2.2 - https://phabricator.wikimedia.org/T192917 [17:06:12] (03PS1) 10Herron: rsyslog: send auth,authpriv.* to central log hosts [puppet] - 10https://gerrit.wikimedia.org/r/429240 [17:06:18] (03CR) 10Ottomata: [C: 032] Enable eventbus Kafka producer snappy compression [puppet] - 10https://gerrit.wikimedia.org/r/429007 (https://phabricator.wikimedia.org/T193080) (owner: 10Ottomata) [17:06:22] 10Operations, 10CirrusSearch, 10Discovery, 10Search-Platform-Programs, 10Discovery-Search (Current work): Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4162020 (10EBjune) [17:06:35] (03PS2) 10Andrew Bogott: bootstrap_vz: re-order the ldap phases [puppet] - 10https://gerrit.wikimedia.org/r/429239 [17:07:10] (03CR) 10Andrew Bogott: [C: 032] bootstrap_vz: re-order the ldap phases [puppet] - 10https://gerrit.wikimedia.org/r/429239 (owner: 10Andrew Bogott) [17:07:44] 10Operations, 10CirrusSearch, 10Discovery, 10Search-Platform-Programs, 10Discovery-Search (Current work): Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4162026 (10debt) p:05Unbreak!>03High [17:07:48] (03PS3) 10Muehlenhoff: Remove obsolete mediawiki multimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/428634 [17:08:52] (03CR) 10Herron: [C: 04-2] "Needs discussion before merging" [puppet] - 10https://gerrit.wikimedia.org/r/429240 (owner: 10Herron) [17:09:09] !log applying compression_type=snappy to eventbus service kafka producer [17:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:09] PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [17:10:32] mutante: Heads-up that the ORES venv patch is currently landing on production. When the deployment is finished, it would be great if you could rm -rf the old directory, if you have the time. [17:11:40] awight: is it time-sensitive? i am happy to do that but i have to go afk for like.. maybe 45min [17:13:45] mutante: No, there’s no rush. I wanted you to be aware of the change in general, but the cleanup can happen any time, it’s just for sanity and not anything functional. [17:14:04] Thanks for the help on beta! [17:15:01] (03CR) 10Mobrovac: [C: 031] cassandra: increase `vm.max_map_count` to 1048575 [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans) [17:15:27] awight: will do! thanks [17:15:29] bbl [17:18:49] 10Operations, 10fundraising-tech-ops, 10netops: New PFW policy - https://phabricator.wikimedia.org/T193189#4162058 (10cwdent) [17:24:01] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967#4162092 (10Gehel) Removing discovery / search from this ticket, since it is really not related to search. [17:27:17] !log awight@tin Finished deploy [ores/deploy@5b27205]: ORES: update to revscoring 2.2.2, T192917 (duration: 21m 20s) [17:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:25] T192917: Rebuild all models for revscoring 2.2.2 - https://phabricator.wikimedia.org/T192917 [17:31:46] Finished deploying ORES and it looks healthy :-) [17:31:46] (03CR) 10Gehel: "minor comment inline" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429233 (owner: 10EBernhardson) [17:32:15] 10Operations, 10Puppet, 10Analytics, 10Cassandra, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10Eevans) The restbase cluster has been upgraded package-wise, but a rolling restart still needs to be scheduled. [17:32:32] (03CR) 10Muehlenhoff: [C: 031] "Looks good, thanks. I'll smoketest that with a job runner reimage next week." [puppet] - 10https://gerrit.wikimedia.org/r/429230 (owner: 10Volans) [17:34:22] 10Operations, 10Puppet, 10Analytics, 10Cassandra, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10MoritzMuehlenhoff) >>! In T192948#4162125, @Eevans wrote: > The restbase cluster has been upgraded package-wise, but a rolling rest... [17:35:09] RECOVERY - puppet last run on mw2286 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:36:29] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4162142 (10MarcoAurelio) Logs are also quite heavy: ``` maurelio@deployment-cpjobqueue:/srv/log/cpjobqueue$ sudo ls -lash * 11G -rw-r--r-- 1 cpjobqueue cpjobqueue 11G Apr 25 00:57 main... [17:37:53] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4162148 (10awight) [17:40:16] PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [17:40:40] <_joe_> uhm [17:41:06] checking [17:41:09] what's up? [17:41:23] I have the same question [17:41:29] just got the page, the hdfs namenode went down apparently [17:41:36] it should be failed over to an1002 in theory [17:41:40] going to check and report back [17:45:59] we are discussing in #analytics what to do, but basically it seems that there was a problem with the journal nodes and the hdfs namenode on an1001 decided to shutdown [17:47:51] (03Abandoned) 10Chad: Move wiktionary and foundationwiki docroots to standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/402090 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad) [17:48:26] (03PS3) 10Chad: Swap mediawiki.org to use standard docroot naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/421949 [17:50:13] (03PS4) 10Chad: Gerrit: Move all logging to /var/log [puppet] - 10https://gerrit.wikimedia.org/r/423794 [17:50:23] (03CR) 10Paladox: "I wonder would this still work it being behind varnish?" [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad) [17:51:36] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4162202 (10MarcoAurelio) 05Open>03Resolved a:03mobrovac Fixed by @mobrovac. Thanks. [17:51:45] Ack thanks for the update elukey [17:53:52] (03CR) 10Smalyshev: "> Patch Set 2: Code-Review+1" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [17:54:19] (03CR) 10Paladox: "> My only concern is that we'd also be exposing Gerrit itself, not" [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad) [17:54:38] (03CR) 10Muehlenhoff: [C: 031] "Looks good, we can give it a smoke test with mw1221 next week." [puppet] - 10https://gerrit.wikimedia.org/r/429229 (owner: 10Volans) [17:56:41] (03PS1) 10Herron: install_server: reinstall mx1001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/429241 (https://phabricator.wikimedia.org/T175361) [17:57:44] (03PS1) 10Imarlier: webperfX001: start using the webperf role [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) [17:58:47] (03CR) 10Imarlier: "Exactly the same change as 392030, with the exception of adding to the dsh target list for the webperf group. Now safe due to no longer d" [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [17:59:15] (03PS3) 10Chad: Gerrit: Run directly from deployment location [puppet] - 10https://gerrit.wikimedia.org/r/423801 [17:59:33] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure, 10ChangeProp, and 3 others: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4162216 (10mobrovac) [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:05:47] (03PS1) 10Muehlenhoff: debdeploy-deploy: Sort modified packages [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/429244 [18:07:48] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [18:09:11] ^ argon is fine, systemd update logged above [18:09:26] RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [18:10:31] 10Operations, 10fundraising-tech-ops, 10netops: New PFW policy - https://phabricator.wikimedia.org/T193189#4162232 (10ayounsi) 05Open>03Resolved a:03ayounsi Pushed. ``` $ nc -zv 208.80.155.8 22 Connection to 208.80.155.8 22 port [tcp/ssh] succeeded! ``` [18:12:02] !log reimaging (some?) kafka200* codfw main kafka nodes to stretch T192832 [18:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:09] T192832: Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832 [18:17:43] (03PS2) 10Muehlenhoff: Sort results of debdeploy-deploy, debdeploy-restarts and debdeploy-pkgversion [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/429244 [18:17:48] (03PS1) 10ArielGlenn: generate checksums on a per job basis, updating the hash as needed [dumps] - 10https://gerrit.wikimedia.org/r/429245 [18:18:29] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete mediawiki multimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/428634 (owner: 10Muehlenhoff) [18:23:47] so analytics1001 is back on track [18:23:59] we are not super sure of what happened to the journal nodes [18:24:04] but we are going to investigate it [18:24:14] also the page-all for those hosts is probably not great [18:24:20] so we'll remove the critical [18:27:56] (03PS1) 10Elukey: profile::hadoop::master: avoid paging all for process down [puppet] - 10https://gerrit.wikimedia.org/r/429251 [18:28:22] ottomata: --^ [18:29:14] (03CR) 10Elukey: [C: 032] profile::hadoop::master: avoid paging all for process down [puppet] - 10https://gerrit.wikimedia.org/r/429251 (owner: 10Elukey) [18:29:49] (03PS1) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 [18:30:32] (03CR) 10jerkins-bot: [V: 04-1] Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (owner: 10Imarlier) [18:36:18] (03PS2) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 [18:36:38] (03PS1) 10Cmjohnson: adding dhcpd and netboot.cfg for lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/429254 (https://phabricator.wikimedia.org/T184293) [18:37:04] (03CR) 10jerkins-bot: [V: 04-1] Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (owner: 10Imarlier) [18:37:12] (03PS3) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 [18:37:47] (03CR) 10jerkins-bot: [V: 04-1] Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (owner: 10Imarlier) [18:37:52] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:38:39] (03PS4) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 [18:39:13] (03CR) 10jerkins-bot: [V: 04-1] Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (owner: 10Imarlier) [18:39:52] (03PS5) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) [18:41:59] (03PS2) 10Cmjohnson: adding dhcpd and netboot.cfg for lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/429254 (https://phabricator.wikimedia.org/T184293) [18:42:40] (03CR) 10Cmjohnson: [C: 032] adding dhcpd and netboot.cfg for lvs1016 [puppet] - 10https://gerrit.wikimedia.org/r/429254 (https://phabricator.wikimedia.org/T184293) (owner: 10Cmjohnson) [18:44:56] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4162375 (10Cmjohnson) @ayounsi Can you create a subnet for LVS for row D please. [18:45:28] (03PS6) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) [18:51:24] (03PS7) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) [18:56:41] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 extra NIC connections - https://phabricator.wikimedia.org/T193196#4162403 (10chasemp) [18:56:56] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 extra NIC connections - https://phabricator.wikimedia.org/T193196#4162415 (10chasemp) p:05Triage>03Normal [18:58:59] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 extra NIC connections - https://phabricator.wikimedia.org/T193196#4162417 (10chasemp) [18:59:04] (03PS8) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) [19:00:04] no_justification: Time to snap out of that daydream and deploy MediaWiki train. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T1900). [19:03:37] (03PS9) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) [19:05:27] ^ misread that as cool things [19:06:54] (03CR) 10Dzahn: "if the concern is exposing /r/, wouldn't that be the same whether we serve it directly or cache it?" [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad) [19:07:56] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4162432 (10chasemp) [19:08:02] (03CR) 10Chad: [C: 032] group2 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429216 (owner: 10Chad) [19:08:09] (03PS1) 10Muehlenhoff: Switch scap proxy in C6 to mw1320 [puppet] - 10https://gerrit.wikimedia.org/r/429260 [19:08:25] no_justification i think for that ^^ we could resolve the security concern by not using an alias, instead defining a new virtual host for gerrit.wmfusercontent.org and getting it to look into the avatars folder we want it to. [19:09:20] (03Merged) 10jenkins-bot: group2 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429216 (owner: 10Chad) [19:09:27] I wasn't going to use an alias, but yes, you're right. [19:09:42] (03CR) 10jenkins-bot: group2 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429216 (owner: 10Chad) [19:09:46] I wasn't thinking right [19:10:04] let's lookup the ticket that set it up for phab.wmfusercontent [19:10:09] no_justification: uh. If you're gonna do .1... [19:10:11] maybe there are comments from traffic [19:10:15] https://gerrit.wikimedia.org/r/#/c/429250/ https://phabricator.wikimedia.org/T193191 [19:10:50] Which nobody tagged as blocker :( [19:11:06] Bad James_F :P [19:11:19] (he was multi tasking) [19:11:45] paladox: RT !:p [19:11:52] heh [19:11:57] paladox: on https://phabricator.wikimedia.org/rOPUP351f9c354beca351bde5436abb67e880b696e2f3 [19:12:03] the new certificate requested in RT: 8212 [19:12:14] lol [19:12:25] We could use letsencrypt for this? [19:12:44] probably, yea [19:13:01] well, it depends [19:13:06] if we serve it directly, yes [19:13:19] if we want to do it like phab, no [19:13:26] oh [19:13:27] but we already have star.wmfusercontent.org no matter what [19:13:32] so .. we just use that [19:13:38] *. [19:15:46] in 2014 "and i would request *.wmfusercontent.org right away, i" heh [19:15:50] it paid off i guess [19:18:45] paladox: so RT 8212 isn't in Phab because it's a procurement ticket. but that was the one to buy this cert [19:18:53] oh [19:19:35] but let's try these: RT 7483, RT 8345 [19:19:38] mutante: Can we differentiate by vhost or does it have to be by port? [19:19:47] (03PS1) 10Ottomata: Use /etc/prometheus as config_dir for kafka broker jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/429262 (https://phabricator.wikimedia.org/T192832) [19:19:59] I s'pose varnish sets the Host header. [19:21:02] (03CR) 10Ottomata: [C: 032] Use /etc/prometheus as config_dir for kafka broker jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/429262 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [19:21:09] Sorry, yeah, should have tagged it. [19:21:24] cmjohnson1: merging your lvs1016 change [19:21:33] (03PS1) 10Rush: openstack: fixup keystone logrotate [puppet] - 10https://gerrit.wikimedia.org/r/429264 (https://phabricator.wikimedia.org/T193048) [19:21:39] ottomata: thx...sorry got sidetracked [19:22:46] no_justification: by host, but there is varnish VCL code for that [19:22:56] no_justification: we should just try to copy the phab.wmfusercontent setup [19:23:00] but there is this: [19:23:07] // Block WP Zero users from accessing Phabricator uploads to prevent abuse [19:23:11] if (req.http.Host == "phab.wmfusercontent.org") { [19:23:12] .. etc [19:23:18] We won't need that on gerrit [19:23:22] Nobody can upload that here [19:23:23] (03PS2) 10Rush: openstack: fixup keystone logrotate [puppet] - 10https://gerrit.wikimedia.org/r/429264 (https://phabricator.wikimedia.org/T193048) [19:23:24] WP zero is going away this year [19:24:54] and also per no_justification [19:25:04] no_justification: if we put it behind varnish we can just use the *.wmfusercontent.org cert and done. if we serve it directly then that would require copying the star unified cert to the gerrit machine which we probably dont want to do [19:25:42] we could use Letsencrypt and serve it directly with some other name.. but the first option is what we got the wildcart cert for [19:26:55] we could also copy other parts, like $altdom = hiera('phabricator_altdomain', 'phab.wmfusercontent.org'), and related [19:27:30] (03PS1) 10Muehlenhoff: Stop installing oggvideotools [puppet] - 10https://gerrit.wikimedia.org/r/429265 [19:29:55] (03PS3) 10Rush: openstack: fixup keystone logrotate [puppet] - 10https://gerrit.wikimedia.org/r/429264 (https://phabricator.wikimedia.org/T193048) [19:29:56] the varnish setup in hieradata/role/common/cache/misc.yaml has a single director called 'phabricator' and 2 host names are pointing to the same directory. the backend Apache then has virtual hosts [19:30:07] s/directory/director [19:31:05] https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/cache/misc.yaml#L101 [19:37:23] mutante: Yeah so the dns bit will be fine [19:38:01] (03CR) 10Paladox: [C: 031] Add gerrit.wmfusercontent.org DNS entry [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad) [19:38:17] (03PS2) 10Dzahn: Add gerrit.wmfusercontent.org DNS entry [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad) [19:38:22] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4162491 (10MoritzMuehlenhoff) @Lea_WMDE : We're making good progress with the stretch migration, we should be good to start the wikidiff roll... [19:38:41] (03CR) 10Dzahn: [C: 032] Add gerrit.wmfusercontent.org DNS entry [dns] - 10https://gerrit.wikimedia.org/r/428869 (owner: 10Chad) [19:40:19] we should also have a redirect like http://phab.wmfusercontent.org/ [19:41:47] but in Apache while Phab is doing it in PHP(?) [19:43:03] yep [19:45:55] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#4162505 (10chasemp) 05Open>03Resolved These are now debian jessie shoutout to @robh for helping me work through some install issues :) [19:46:10] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): tools-k8s-master-01 has two floating IPs - https://phabricator.wikimedia.org/T164123#4162508 (10chasemp) a:05chasemp>03None [19:46:12] could be: [19:46:17] RewriteCond %{HTTP_HOST} gerrit.wmfusercontent.org$ [19:46:18] RewriteRule (.*) https://gerrit.wikimedia.org/ [P] [19:46:22] no_justification mutante ^^ [19:47:54] No.... [19:48:02] I'll do it later [19:48:33] ok [19:56:30] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4162517 (10bd808) a:05madhuvishy>03None [20:09:28] !log demon@tin rebuilt and synchronized wikiversions files: group2 to wmf.1 [20:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:57] (03CR) 10Brion VIBBER: [C: 031] "No longer in use since thumbnailing moved to thumbor (and falls back to ffmpeg anyway)" [puppet] - 10https://gerrit.wikimedia.org/r/429265 (owner: 10Muehlenhoff) [20:29:31] !log contint1001: cleaned up old Docker images produced by docker-pkg [20:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:40] (03PS1) 10Andrew Bogott: bootstrap-vz: rearrange nscd/nslcd refreshes [puppet] - 10https://gerrit.wikimedia.org/r/429338 [20:34:29] (03CR) 10Andrew Bogott: [C: 032] bootstrap-vz: rearrange nscd/nslcd refreshes [puppet] - 10https://gerrit.wikimedia.org/r/429338 (owner: 10Andrew Bogott) [20:38:44] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: 1.662e+05 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:38:54] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: 7.425e+04 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:39:04] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: 6.633e+04 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:39:44] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: (C)1.5e+04 ge (W)1e+04 ge 4821 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:39:54] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: (C)1.5e+04 ge (W)1e+04 ge 4872 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:40:05] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: (C)1.5e+04 ge (W)1e+04 ge 4205 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:45:32] (03Draft1) 10MarcoAurelio: idwikimedia: register on DNS [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) [20:45:35] (03PS2) 10MarcoAurelio: idwikimedia: register on DNS [dns] - 10https://gerrit.wikimedia.org/r/429339 (https://phabricator.wikimedia.org/T192726) [20:45:44] (03Draft1) 10MarcoAurelio: idwikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/429342 (https://phabricator.wikimedia.org/T192726) [20:45:49] (03PS2) 10MarcoAurelio: idwikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/429342 (https://phabricator.wikimedia.org/T192726) [20:46:43] (03CR) 10Rush: [C: 032] openstack: fixup keystone logrotate [puppet] - 10https://gerrit.wikimedia.org/r/429264 (https://phabricator.wikimedia.org/T193048) (owner: 10Rush) [20:46:48] (03PS4) 10Rush: openstack: fixup keystone logrotate [puppet] - 10https://gerrit.wikimedia.org/r/429264 (https://phabricator.wikimedia.org/T193048) [20:51:13] (03PS1) 10Herron: WIP: icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937) [20:51:39] (03CR) 10jerkins-bot: [V: 04-1] WIP: icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937) (owner: 10Herron) [20:52:56] (03PS2) 10Herron: WIP: icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937) [20:54:48] (03PS2) 10Imarlier: coal: remove files that aren't needed any longer [puppet] - 10https://gerrit.wikimedia.org/r/428980 (https://phabricator.wikimedia.org/T191994) [20:56:11] (03CR) 10Imarlier: "Tagging a bunch of people who have merge access to puppet -- sorry about the review-spam." [puppet] - 10https://gerrit.wikimedia.org/r/428980 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [21:00:04] MaxSem and kaldari: How many deployers does it take to do Redeploy ArticleCreationWorkflow deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T2100). [21:00:06] (03PS3) 10MarcoAurelio: idwikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/429342 (https://phabricator.wikimedia.org/T192726) [21:00:24] no_justification: are we clear to proceed? [21:00:36] Proceed....? [21:00:49] just 1 I imagine [21:01:01] in other words, is your deployment done, no_justification? [21:01:07] Been done [21:01:12] wee [21:01:54] MaxSem: I have a 1:1 meeting with Danny right now. Do you need me for testing or do you have it covered? [21:02:14] I guess I can handle it kaldari [21:05:59] !log maxsem@tin Synchronized php-1.32.0-wmf.1/extensions/ArticleCreationWorkflow/: https://gerrit.wikimedia.org/r/#/c/429111/ (duration: 01m 00s) [21:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:33] (03PS3) 10MaxSem: Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) [21:07:38] (03CR) 10MaxSem: [C: 032] Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem) [21:09:01] (03Merged) 10jenkins-bot: Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem) [21:09:54] (03CR) 10jenkins-bot: Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem) [21:13:18] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/429017 (duration: 00m 59s) [21:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:03] !log maxsem@tin Started scap: Deploy ACW to test wikis, https://gerrit.wikimedia.org/r/429017 / T192455 [21:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:09] T192455: Permanently implement autoconfirmed-account-requirement for new article creation on en.wiki - https://phabricator.wikimedia.org/T192455 [21:20:57] (03PS1) 10Bstorm: wiki replicas: add GRANT statement to $wiki_p database creation [puppet] - 10https://gerrit.wikimedia.org/r/429349 (https://phabricator.wikimedia.org/T188490) [21:21:36] (03PS1) 10Andrew Bogott: bootstrapvz: one more attempt to properly order nscd and nslcd restarts [puppet] - 10https://gerrit.wikimedia.org/r/429350 [21:22:19] (03CR) 10Andrew Bogott: [C: 032] bootstrapvz: one more attempt to properly order nscd and nslcd restarts [puppet] - 10https://gerrit.wikimedia.org/r/429350 (owner: 10Andrew Bogott) [21:29:45] (03PS2) 10MaxSem: Redeploy ArticleCreationWorkflow, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) [21:38:34] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#2078914 (10Pcoombe) The store HSTS header now has `max-age=31557600`, but still no `includeSubDomains` or `preload`. [21:43:35] 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4162818 (10MarcoAurelio) @EddieGP deployment-cpjobqueue puppet was broken due to disk full; this was f... [21:44:28] hmm, "Updating LocalisationCache for 1.32.0-wmf.1 using 10 thread(s)"'s been running for 30 minutes now :O [21:47:19] is that still running slow due to hhvm? [21:48:06] yup [21:48:26] booo [21:48:43] didn't realise that the 40 minute long run reported is for a single step [21:51:42] 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4162850 (10MarcoAurelio) Also, there are stalled jobs in the `job` table: ``` wikiadmin@deployment-db... [22:11:10] !log maxsem@tin Finished scap: Deploy ACW to test wikis, https://gerrit.wikimedia.org/r/429017 / T192455 (duration: 57m 06s) [22:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:17] T192455: Permanently implement autoconfirmed-account-requirement for new article creation on en.wiki - https://phabricator.wikimedia.org/T192455 [22:12:11] (03CR) 10Niharika29: [C: 031] "You'd have to SWAT this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt) [22:13:50] (03CR) 10MaxSem: [C: 032] Redeploy ArticleCreationWorkflow, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem) [22:13:52] (03CR) 10Niharika29: "This will go out in SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem) [22:15:06] (03Merged) 10jenkins-bot: Redeploy ArticleCreationWorkflow, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem) [22:19:42] (03CR) 10jenkins-bot: Redeploy ArticleCreationWorkflow, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem) [22:21:43] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/429100/ (duration: 01m 00s) [22:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:27] kaldari: we're live ^ [22:22:37] cool [22:23:37] (03Abandoned) 10Jcrespo: Add ferm service for mariadb_dbproxy [puppet] - 10https://gerrit.wikimedia.org/r/316341 (owner: 10Muehlenhoff) [22:24:34] MaxSem: I confirmed that it's working [22:30:28] (03PS1) 10Jgreen: A/PTR for frbast1001.wikimedia.org and service cnames for frack bastions [dns] - 10https://gerrit.wikimedia.org/r/429354 (https://phabricator.wikimedia.org/T193178) [22:34:11] (03CR) 10Jgreen: [C: 032] A/PTR for frbast1001.wikimedia.org and service cnames for frack bastions [dns] - 10https://gerrit.wikimedia.org/r/429354 (https://phabricator.wikimedia.org/T193178) (owner: 10Jgreen) [22:38:35] !log deployed DNS update for frbast1001.wikimedia.org [22:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:05] hey, anybody know if we changed poppler-utils package recently? [22:40:33] getting reports of new PDFs failing to render as '0x0' which symptom matches up with having a newer pdfinfo command [22:40:48] i have a fix in the works for PdfHandler to call pdfinfo correctly for both old and new versions [22:41:10] but it'd be nice to confirm if the package changed recently [22:44:40] !log start test measuring elasticsearch master mutation latency in codfw [22:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:20] Jesus [22:45:25] That page is well over a meg [22:48:10] you get a meg [22:48:12] and YOU get a meg [22:57:00] (03PS1) 10Subramanya Sastry: Enable RemexHtml on wikis with <100 issues in high-priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429357 (https://phabricator.wikimedia.org/T192299) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180426T2300). [23:00:04] Niharika and brion: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:18] \o/ [23:01:19] oh can i add one more real quick? [23:01:59] or should i wait to test it more :D [23:02:23] testing? do it in production >.> *runs away* [23:03:25] agh, i can't log in to wikitech on this laptop. yay keys [23:03:40] anyway, https://gerrit.wikimedia.org/r/#/c/429356/ is a hotfix for PdfHandler [23:03:45] but it can wait if necessary [23:04:47] * Reedy looks at the swat queue [23:05:16] (03PS3) 10Reedy: Enable CodeMirror on RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) (owner: 10Niharika29) [23:05:20] o/ [23:05:25] I'm here. [23:05:41] (03CR) 10Reedy: [C: 032] Enable CodeMirror on RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) (owner: 10Niharika29) [23:07:05] (03Merged) 10jenkins-bot: Enable CodeMirror on RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) (owner: 10Niharika29) [23:07:46] Niharika: Do you care about it going on mwdebug? [23:08:47] Reedy: No preference. [23:09:02] You'll bear the blame if things break. :P [23:09:41] (03CR) 10jenkins-bot: Enable CodeMirror on RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) (owner: 10Niharika29) [23:10:04] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 01m 00s) [23:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:15] stupid paste fail [23:10:40] codemirror is on arwiki now [23:11:57] Reedy: All RTL wikis, right? [23:12:47] yup [23:13:38] Thanks! [23:14:23] brion: Just want the UW patch pushing everywhere too? [23:14:50] Reedy: yep, it's a fix for a previous patch so should go out all-wheres [23:15:04] Sorry, I mean, do you want it mwdebug first? [23:15:11] ah :D [23:15:19] nah just put it out [23:16:57] !log reedy@tin Synchronized php-1.32.0-wmf.1/extensions/UploadWizard/: (no justification provided) (duration: 01m 00s) [23:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:35] heh [23:22:18] Booo [23:22:32] your pdfhandler patch won't trivially cherry pick to 1.30 or earlier [23:22:51] yeah it changed from a large string to small strings i think [23:22:56] will have to manually backport [23:23:17] i don't think we need that for prod though, so i'll do tomorrow [23:23:38] parameter splitting into arrays and stuff [23:23:57] yep [23:24:43] I'm surprised though [23:24:48] cherry pick on cli [23:24:52] it finds the right file etc [23:24:55] (cause it's renamed) [23:26:04] rename detection ftw [23:27:09] (03PS2) 10Dzahn: webperfX001: start using the webperf role [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [23:28:45] (03CR) 10Dzahn: "this removes the "perf-roots" admin group from the host specific files and the role has the admin group "perf-team". So they are not the s" [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [23:31:00] why do we have "perf-team" and "perf-roots" admin groups if both of them have the exact same privileges (root) and the members also overlap, heh [23:31:36] !log reedy@tin Synchronized php-1.32.0-wmf.1/extensions/PdfHandler/: (no justification provided) (duration: 01m 00s) [23:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:00] ah, because perf-roots are applied on way more things.. right [23:32:38] (03CR) 10Dzahn: "nevermind, i see both have the same privileges but are used in a different context. lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [23:32:42] (03CR) 10Dzahn: [C: 032] webperfX001: start using the webperf role [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [23:32:54] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 330 MB (3% inode=75%) [23:35:13] (03CR) 10Dzahn: [C: 032] "on webperf1001/2001 all the users have been created, packages have been installed.. there is just an error that it fails to start the stat" [puppet] - 10https://gerrit.wikimedia.org/r/429242 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [23:36:23] (03PS3) 10Dzahn: coal: remove files that aren't needed any longer [puppet] - 10https://gerrit.wikimedia.org/r/428980 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [23:36:24] PROBLEM - Check systemd state on webperf1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:36:44] PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:37:08] Krinkle: ^ roles applied , looks like it just needed 2 puppet runs [23:37:18] after the first one statsv wasnt running but now it is [23:37:25] RECOVERY - Check systemd state on webperf1001 is OK: OK - running: The system is fully operational [23:38:03] (03CR) 10Dzahn: [C: 032] coal: remove files that aren't needed any longer [puppet] - 10https://gerrit.wikimedia.org/r/428980 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [23:38:44] RECOVERY - Check systemd state on webperf2001 is OK: OK - running: The system is fully operational [23:46:52] (03CR) 10Dbarratt: "> You'd have to SWAT this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) (owner: 10Dbarratt) [23:48:08] (03PS3) 10Dbarratt: Disable Datetime Selector on Special:Block on all wikis except Meta, MediaWiki, and German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428854 (https://phabricator.wikimedia.org/T192962) [23:49:01] mutante: thanks for the merge! Statsv is scap deployed, which is why it's not starting. I'm not at a computer right now (it's about 8pm here), but can take care of that in a bit. Is there anything alerting as a result? If so, can it just be silenced? [23:52:02] Oh, hey, looks like it addressed itself. Interesting! Not sure how, but all good. [23:53:07] These things are all atomic, either by design or via Kafka commit/statsd