[00:02:32] PROBLEM - Nginx local proxy to apache on mw2256 is CRITICAL: connect to address 10.192.16.55 and port 443: Connection refused [00:02:32] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2242 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:02:32] PROBLEM - Check systemd state on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:02:32] PROBLEM - Check size of conntrack table on mw2255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:04:12] PROBLEM - Check whether ferm is active by checking the default input chain on mw2242 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:04:12] PROBLEM - configured eth on mw2242 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:04:12] PROBLEM - Nginx local proxy to apache on mw2255 is CRITICAL: connect to address 10.192.16.54 and port 443: Connection refused [00:04:12] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:04:12] PROBLEM - Check systemd state on mw2255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:05:53] PROBLEM - DPKG on mw2242 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:05:53] PROBLEM - dhclient process on mw2242 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:05:53] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:05:53] PROBLEM - Check whether ferm is active by checking the default input chain on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:05:53] PROBLEM - configured eth on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:07:42] PROBLEM - mediawiki-installation DSH group on mw2242 is CRITICAL: Host mw2242 is not in mediawiki-installation dsh group [00:07:42] PROBLEM - Check whether ferm is active by checking the default input chain on mw2255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:07:42] PROBLEM - configured eth on mw2255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:07:42] PROBLEM - dhclient process on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:07:42] PROBLEM - DPKG on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:09:22] PROBLEM - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group [00:09:22] PROBLEM - DPKG on mw2255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:09:22] PROBLEM - dhclient process on mw2255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:09:22] PROBLEM - Disk space on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:11:02] PROBLEM - mediawiki-installation DSH group on mw2255 is CRITICAL: Host mw2255 is not in mediawiki-installation dsh group [00:11:02] PROBLEM - Disk space on mw2255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:11:03] PROBLEM - HHVM processes on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:11:03] PROBLEM - nutcracker port on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:11:12] PROBLEM - HHVM rendering on mw2242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:12:43] PROBLEM - nutcracker process on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:14:42] PROBLEM - HHVM rendering on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:52] PROBLEM - HHVM rendering on mw2256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:16:22] PROBLEM - Apache HTTP on mw2242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:16:22] PROBLEM - MD RAID on mw2242 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:16:23] PROBLEM - Check whether ferm is active by checking the default input chain on mw2242 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:16:23] PROBLEM - configured eth on mw2242 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:16:32] PROBLEM - nutcracker port on mw2242 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [00:17:02] RECOVERY - dhclient process on mw2242 is OK: PROCS OK: 0 processes with command name dhclient [00:17:02] RECOVERY - DPKG on mw2242 is OK: All packages OK [00:17:12] PROBLEM - nutcracker process on mw2242 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [00:17:13] RECOVERY - MD RAID on mw2242 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:17:22] RECOVERY - Check whether ferm is active by checking the default input chain on mw2242 is OK: OK ferm input default policy is set [00:17:22] RECOVERY - configured eth on mw2242 is OK: OK - interfaces up [00:18:03] PROBLEM - Apache HTTP on mw2256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:42] PROBLEM - Nginx local proxy to apache on mw2242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.191 second response time [00:19:42] PROBLEM - Check systemd state on mw2242 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:19:52] PROBLEM - Apache HTTP on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:12] PROBLEM - Disk space on mw2255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:20:22] PROBLEM - Check systemd state on mw2255 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:20:32] RECOVERY - dhclient process on mw2255 is OK: PROCS OK: 0 processes with command name dhclient [00:20:32] RECOVERY - DPKG on mw2255 is OK: All packages OK [00:20:42] PROBLEM - nutcracker process on mw2255 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [00:20:43] RECOVERY - Check size of conntrack table on mw2255 is OK: OK: nf_conntrack is 0 % full [00:21:10] RECOVERY - Disk space on mw2255 is OK: DISK OK [00:22:39] PROBLEM - Disk space on mw2256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:22:49] PROBLEM - nutcracker port on mw2255 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [00:22:50] PROBLEM - Check systemd state on mw2256 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:22:59] RECOVERY - Check whether ferm is active by checking the default input chain on mw2256 is OK: OK ferm input default policy is set [00:23:00] RECOVERY - configured eth on mw2256 is OK: OK - interfaces up [00:23:11] PROBLEM - nutcracker port on mw2256 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [00:23:11] RECOVERY - HHVM processes on mw2256 is OK: PROCS OK: 6 processes with command name hhvm [00:23:19] RECOVERY - nutcracker process on mw2242 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [00:23:30] RECOVERY - Disk space on mw2256 is OK: DISK OK [00:23:39] RECOVERY - nutcracker port on mw2242 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [00:23:40] RECOVERY - Check systemd state on mw2242 is OK: OK - running: The system is fully operational [00:23:49] RECOVERY - Nginx local proxy to apache on mw2242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.987 second response time [00:24:10] RECOVERY - HHVM rendering on mw2242 is OK: HTTP OK: HTTP/1.1 200 OK - 75522 bytes in 0.402 second response time [00:24:46] (03CR) 10Chad: [C: 032] Drop MEDIAWIKI_DBLIST_DIR, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428844 (owner: 10Chad) [00:24:49] PROBLEM - Nginx local proxy to apache on mw2257 is CRITICAL: connect to address 10.192.16.56 and port 443: Connection refused [00:24:49] PROBLEM - Check systemd state on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:25:13] What's up with those codfw apaches? [00:25:19] RECOVERY - Apache HTTP on mw2242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.148 second response time [00:25:41] (03CR) 10Chad: [C: 032] Drop MEDIAWIKI_DIRECTORY_REGEX & MEDIAWIKI_VERSION_REGEX unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428845 (owner: 10Chad) [00:26:16] (03Merged) 10jenkins-bot: Drop MEDIAWIKI_DBLIST_DIR, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428844 (owner: 10Chad) [00:26:30] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:26:57] (03Merged) 10jenkins-bot: Drop MEDIAWIKI_DIRECTORY_REGEX & MEDIAWIKI_VERSION_REGEX unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428845 (owner: 10Chad) [00:27:00] (03PS4) 10Chad: Apache: Move all private wikis to a single vhost block [puppet] - 10https://gerrit.wikimedia.org/r/422571 [00:27:39] RECOVERY - Check systemd state on mw2255 is OK: OK - running: The system is fully operational [00:27:50] RECOVERY - nutcracker process on mw2255 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [00:27:59] RECOVERY - nutcracker port on mw2255 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [00:27:59] RECOVERY - Apache HTTP on mw2255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 7.810 second response time [00:28:19] PROBLEM - Check whether ferm is active by checking the default input chain on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:28:19] PROBLEM - configured eth on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:28:39] RECOVERY - Nginx local proxy to apache on mw2255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.202 second response time [00:28:50] RECOVERY - HHVM rendering on mw2255 is OK: HTTP OK: HTTP/1.1 200 OK - 75522 bytes in 0.400 second response time [00:29:10] (03CR) 10jenkins-bot: Drop MEDIAWIKI_DBLIST_DIR, no longer used [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428844 (owner: 10Chad) [00:30:00] RECOVERY - Check systemd state on mw2256 is OK: OK - running: The system is fully operational [00:30:00] PROBLEM - dhclient process on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:30:00] PROBLEM - DPKG on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:30:00] RECOVERY - DPKG on mw2256 is OK: All packages OK [00:30:10] (03Abandoned) 10Chad: WIP: Add git::config{} for calling `git config` on repositories. [puppet] - 10https://gerrit.wikimedia.org/r/416200 (owner: 10Chad) [00:30:11] RECOVERY - nutcracker port on mw2256 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [00:30:49] RECOVERY - HHVM rendering on mw2256 is OK: HTTP OK: HTTP/1.1 200 OK - 75524 bytes in 5.871 second response time [00:31:00] RECOVERY - Nginx local proxy to apache on mw2256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.202 second response time [00:31:00] !log demon@tin Synchronized multiversion/defines.php: rm unused defines (duration: 01m 16s) [00:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:40] PROBLEM - mediawiki-installation DSH group on mw2257 is CRITICAL: Host mw2257 is not in mediawiki-installation dsh group [00:31:40] PROBLEM - Disk space on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:32:10] RECOVERY - Apache HTTP on mw2256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 3.397 second response time [00:32:29] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2242 is OK: OK: synced at Wed 2018-04-25 00:32:26 UTC. [00:33:20] PROBLEM - nutcracker port on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:33:20] PROBLEM - HHVM processes on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:34:10] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2256 is OK: OK: synced at Wed 2018-04-25 00:34:06 UTC. [00:34:36] (03PS1) 10Chad: Add gerrit.wmfusercontent.org DNS entry [dns] - 10https://gerrit.wikimedia.org/r/428869 [00:34:49] RECOVERY - dhclient process on mw2256 is OK: PROCS OK: 0 processes with command name dhclient [00:35:09] PROBLEM - HHVM rendering on mw2257 is CRITICAL: connect to address 10.192.16.56 and port 80: Connection refused [00:35:09] PROBLEM - nutcracker process on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:35:59] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2255 is OK: OK: synced at Wed 2018-04-25 00:35:52 UTC. [00:36:29] RECOVERY - Check whether ferm is active by checking the default input chain on mw2255 is OK: OK ferm input default policy is set [00:36:29] RECOVERY - nutcracker process on mw2256 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [00:36:50] PROBLEM - puppet last run on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:40:00] RECOVERY - configured eth on mw2255 is OK: OK - interfaces up [00:40:20] PROBLEM - MD RAID on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:41:59] PROBLEM - Check size of conntrack table on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:44:20] PROBLEM - HHVM rendering on mw2257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:45:29] PROBLEM - Apache HTTP on mw2257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:51:20] PROBLEM - nutcracker process on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:51:20] PROBLEM - dhclient process on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:51:20] PROBLEM - DPKG on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:51:29] RECOVERY - Apache HTTP on mw2257 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [00:51:29] PROBLEM - MD RAID on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:51:30] PROBLEM - configured eth on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:51:30] PROBLEM - Check whether ferm is active by checking the default input chain on mw2257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [00:51:39] PROBLEM - nutcracker port on mw2257 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [00:51:39] RECOVERY - HHVM processes on mw2257 is OK: PROCS OK: 6 processes with command name hhvm [00:51:49] RECOVERY - Disk space on mw2257 is OK: DISK OK [00:52:09] RECOVERY - Check size of conntrack table on mw2257 is OK: OK: nf_conntrack is 0 % full [00:52:09] PROBLEM - Check systemd state on mw2257 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:52:19] RECOVERY - dhclient process on mw2257 is OK: PROCS OK: 0 processes with command name dhclient [00:52:19] RECOVERY - DPKG on mw2257 is OK: All packages OK [00:52:19] PROBLEM - HHVM rendering on mw2257 is CRITICAL: connect to address 10.192.16.56 and port 80: Connection refused [00:52:29] RECOVERY - MD RAID on mw2257 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:52:30] RECOVERY - Check whether ferm is active by checking the default input chain on mw2257 is OK: OK ferm input default policy is set [00:52:30] RECOVERY - configured eth on mw2257 is OK: OK - interfaces up [00:55:10] RECOVERY - Nginx local proxy to apache on mw2257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.926 second response time [00:55:29] RECOVERY - HHVM rendering on mw2257 is OK: HTTP OK: HTTP/1.1 200 OK - 75535 bytes in 7.664 second response time [00:56:41] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2257 is OK: OK: synced at Wed 2018-04-25 00:56:35 UTC. [00:59:31] RECOVERY - Long running screen/tmux on restbase1010 is OK: OK: No SCREEN or tmux processes detected. [01:00:22] RECOVERY - nutcracker process on mw2257 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [01:00:51] RECOVERY - nutcracker port on mw2257 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [01:01:12] RECOVERY - Check systemd state on mw2257 is OK: OK - running: The system is fully operational [01:07:01] RECOVERY - puppet last run on mw2257 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:14:57] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite#001 nodes to webperf#001 - https://phabricator.wikimedia.org/T159354#4156092 (10Krinkle) @Imarlier I landed it as-is. Nevermind about using the `/etc/wikimedia-cluster` file ([puppet](https://github.com/wikimedia/puppet/blob/0b915... [01:34:25] (03PS2) 10Bstorm: wiki replicas: index script should be able to operate on one DB [puppet] - 10https://gerrit.wikimedia.org/r/428550 [01:34:50] (03CR) 10Bstorm: wiki replicas: index script should be able to operate on one DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428550 (owner: 10Bstorm) [01:34:59] (03PS3) 10Bstorm: wiki replicas: index script should be able to operate on one DB [puppet] - 10https://gerrit.wikimedia.org/r/428550 [01:36:31] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=75%) [02:01:11] RECOVERY - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.116 port 9042 [02:31:41] RECOVERY - mediawiki-installation DSH group on mw2257 is OK: OK [02:36:02] PROBLEM - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.114 and port 9042: Connection refused [02:36:31] PROBLEM - cassandra-a SSL 10.64.0.114:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [02:46:32] RECOVERY - cassandra-a SSL 10.64.0.114:7001 on restbase1010 is OK: SSL OK - Certificate restbase1010-a valid until 2018-08-17 16:11:05 +0000 (expires in 114 days) [02:47:02] RECOVERY - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.114 port 9042 [02:48:03] 10Operations, 10ops-eqiad, 10Cassandra, 10hardware-requests, and 3 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4156115 (10Eevans) All 3 instances of 1010 have been bootstrapped. [02:55:24] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.30) (duration: 07m 23s) [02:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:07:41] RECOVERY - mediawiki-installation DSH group on mw2242 is OK: OK [03:09:21] RECOVERY - mediawiki-installation DSH group on mw2256 is OK: OK [03:11:01] RECOVERY - mediawiki-installation DSH group on mw2255 is OK: OK [03:26:11] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 708.46 seconds [03:37:13] https://phabricator.wikimedia.org/T192866 [03:37:59] could someone look into this please? It is important, and prevent working [03:38:37] I am available if testing is needed [03:52:34] yannf: the files are not identical [03:53:06] ori, which files? [03:53:41] Nouveau_Larousse_illustré,_1898,_IV.djvu is about 25k larger than Nouveau_Larousse_illustré,_1898,_IV_test.djvu [03:57:25] (03CR) 10BryanDavis: [C: 031] wiki replicas: index script should be able to operate on one DB [puppet] - 10https://gerrit.wikimedia.org/r/428550 (owner: 10Bstorm) [03:59:51] (03CR) 10Bstorm: [C: 032] wiki replicas: index script should be able to operate on one DB [puppet] - 10https://gerrit.wikimedia.org/r/428550 (owner: 10Bstorm) [04:04:12] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 208.30 seconds [04:08:49] ori, yes, some metadata was changed, otherwise the uploader couldn't upload it [04:09:37] I reuploaded the old version here: https://commons.wikimedia.org/wiki/File:Nouveau_Larousse_illustr%C3%A9,_1898,_IV_test.djvu [04:10:27] other files from the same series are still not OK: https://commons.wikimedia.org/wiki/File:Nouveau_Larousse_illustr%C3%A9,_1898,_V.djvu [04:11:33] and even when the file looks OK on Commons, it doesn't work on WS: https://fr.wikisource.org/wiki/Livre:Nouveau_Larousse_illustr%C3%A9,_1898,_IV.djvu [05:26:15] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4156182 (10Marostegui) [05:27:45] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4116638 (10Marostegui) [05:35:15] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4156186 (10Marostegui) [05:36:03] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4116638 (10Marostegui) @Cmjohnson I have confirmed that all the hosts with the exception of db1120 as you mentioned, are up and ready - let's keep this opened till it is fixed. T... [05:48:07] (03PS1) 10Marostegui: mariadb: Convert db1116 as sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/428872 (https://phabricator.wikimedia.org/T192979) [05:49:31] (03PS2) 10Marostegui: mariadb: Convert db1116 as sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/428872 (https://phabricator.wikimedia.org/T192979) [05:50:25] (03PS3) 10Marostegui: mariadb: Convert db1116 as sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/428872 (https://phabricator.wikimedia.org/T192979) [06:13:26] (03CR) 10Marostegui: [C: 032] mariadb: Convert db1116 as sanitarium multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/428872 (https://phabricator.wikimedia.org/T192979) (owner: 10Marostegui) [06:14:39] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428873 [06:14:43] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428873 [06:18:32] (03PS1) 10Marostegui: db1116.yaml: Give it the correct shards [puppet] - 10https://gerrit.wikimedia.org/r/428874 [06:19:20] (03CR) 10Marostegui: [C: 032] db1116.yaml: Give it the correct shards [puppet] - 10https://gerrit.wikimedia.org/r/428874 (owner: 10Marostegui) [06:27:50] 10Operations, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#4156221 (10jcrespo) Note I was not asking it, the main improvement of 0.10.0 is multisource support, which we are moving away from. We can wait for buster. [06:29:56] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/nf_conntrack.conf] [06:30:26] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:40:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428876 (https://phabricator.wikimedia.org/T190704) [06:41:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428876 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [06:42:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428876 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [06:44:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1106 (duration: 01m 21s) [06:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:06] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:53:03] !log reimaging mw1314, mw1315, mw1316 (API servers) to stretch [06:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:56] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:00:26] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:03:47] PROBLEM - Check whether ferm is active by checking the default input chain on ping1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [07:03:56] PROBLEM - Check systemd state on ping1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:05:31] !log starting a very slow rolling reboot of all VMs on codfw ganeti cluster, row_C nodegroup, excluding poolcounter1001 and puppetdb1001. T150532 [07:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:38] T150532: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532 [07:05:54] elukey: bohrium is on row_A so ^ this won't affect you for the next few hours [07:11:14] ack! [07:11:22] (03PS3) 10Muehlenhoff: Remove obsolete fontconfig/imagemagick code from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428300 [07:13:09] 10Operations, 10vm-requests: Site: 4 VM request for pdf-render/proton - https://phabricator.wikimedia.org/T192983#4156279 (10akosiaris) [07:13:42] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4156290 (10akosiaris) [07:13:44] 10Operations, 10vm-requests: Site: 4 VM request for pdf-render/proton - https://phabricator.wikimedia.org/T192983#4156289 (10akosiaris) [07:16:40] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete fontconfig/imagemagick code from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428300 (owner: 10Muehlenhoff) [07:23:00] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428873 [07:24:22] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428873 (owner: 10Marostegui) [07:25:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428873 (owner: 10Marostegui) [07:27:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1113:3316 after alter table (duration: 01m 16s) [07:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428877 (https://phabricator.wikimedia.org/T190148) [07:31:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428877 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:32:51] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428877 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [07:34:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1085 for alter table (duration: 01m 16s) [07:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:08] !log Deploy schema change on db1085 with replication (this will generate lag on labsdb hosts on s6) - T191519 T188299 T190148 [07:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:16] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [07:35:16] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [07:35:16] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [07:36:43] (03PS1) 10Urbanecm: New throttle rule for cswiki Wikipedia event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428878 (https://phabricator.wikimedia.org/T192898) [07:42:59] (03PS6) 10Muehlenhoff: mediawiki::packages::fonts: Consistently use require_package [puppet] - 10https://gerrit.wikimedia.org/r/420670 [07:44:31] (03CR) 10Muehlenhoff: [C: 032] mediawiki::packages::fonts: Consistently use require_package [puppet] - 10https://gerrit.wikimedia.org/r/420670 (owner: 10Muehlenhoff) [07:58:27] (03PS1) 10Jcrespo: mariadb-backups: Fix configuration error on eqiad backups [puppet] - 10https://gerrit.wikimedia.org/r/428879 [07:59:30] 10Operations: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4156344 (10fgiunchedi) [07:59:38] 10Operations, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4156346 (10fgiunchedi) [07:59:38] (03CR) 10Jcrespo: [C: 032] mariadb-backups: Fix configuration error on eqiad backups [puppet] - 10https://gerrit.wikimedia.org/r/428879 (owner: 10Jcrespo) [08:00:29] 10Operations: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4062129 (10fgiunchedi) a:05RobH>03fgiunchedi [08:07:57] (03PS3) 10Muehlenhoff: Remove mediawiki::firejail [puppet] - 10https://gerrit.wikimedia.org/r/428382 [08:12:02] (03CR) 10Muehlenhoff: [C: 032] Remove mediawiki::firejail [puppet] - 10https://gerrit.wikimedia.org/r/428382 (owner: 10Muehlenhoff) [08:13:22] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: Traceback (most recent call last) [08:13:32] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last) [08:14:01] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: Traceback (most recent call last) [08:14:11] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: Traceback (most recent call last) [08:14:34] (03PS1) 10Elukey: Enable meminfo_numa collector on Druid and Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/428881 [08:14:51] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: Traceback (most recent call last) [08:17:54] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: Traceback (most recent call last) [08:18:34] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: Traceback (most recent call last) [08:19:01] (03PS1) 10Jcrespo: mariadb: Depool db1090 to upgrade it and clone it to db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428882 [08:19:53] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 0 probes of 322 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [08:19:53] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 8 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:19:53] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:19:53] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 0 probes of 317 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [08:23:16] !log reimaging mw1247, mw1248, mw1249 (app servers) to stretch [08:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:50] !log eqiad-prod: add ms-be104[0-3] with minimal weight - T190081 [08:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:56] T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081 [08:25:46] (03CR) 10Filippo Giunchedi: [C: 031] Enable meminfo_numa collector on Druid and Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/428881 (owner: 10Elukey) [08:26:06] (03CR) 10Elukey: [C: 032] Enable meminfo_numa collector on Druid and Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/428881 (owner: 10Elukey) [08:26:33] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 7 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:28:24] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 5 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:28:47] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 0 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [08:34:56] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1090 to upgrade it and clone it to db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428882 (owner: 10Jcrespo) [08:34:59] (03PS1) 10Alexandros Kosiaris: Introduce poolcounter1003 [dns] - 10https://gerrit.wikimedia.org/r/428883 (https://phabricator.wikimedia.org/T187297) [08:35:01] (03PS1) 10Alexandros Kosiaris: Introduce proton{1,2}00{1,2} VMs [dns] - 10https://gerrit.wikimedia.org/r/428884 (https://phabricator.wikimedia.org/T192983) [08:36:11] (03Merged) 10jenkins-bot: mariadb: Depool db1090 to upgrade it and clone it to db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428882 (owner: 10Jcrespo) [08:37:22] (03CR) 10Jonas Kress (WMDE): [C: 031] Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [08:38:41] !log Drop user_old and user_temp tables from s3 - T172664 [08:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:35] (03PS1) 10Muehlenhoff: Remove obsolete mediawiki::packages::fonts from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428886 [08:51:24] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1090 (duration: 01m 17s) [08:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:48] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce poolcounter1003 [dns] - 10https://gerrit.wikimedia.org/r/428883 (https://phabricator.wikimedia.org/T187297) (owner: 10Alexandros Kosiaris) [08:56:04] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce proton{1,2}00{1,2} VMs [dns] - 10https://gerrit.wikimedia.org/r/428884 (https://phabricator.wikimedia.org/T192983) (owner: 10Alexandros Kosiaris) [08:59:31] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:00:22] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.067 second response time [09:01:41] PROBLEM - Nginx local proxy to apache on mw1247 is CRITICAL: connect to address 10.64.48.82 and port 443: Connection refused [09:01:41] PROBLEM - Check size of conntrack table on mw1247 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:01:41] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1248 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:01:41] PROBLEM - configured eth on mw1248 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:01:41] PROBLEM - mediawiki-installation DSH group on mw1249 is CRITICAL: Host mw1249 is not in mediawiki-installation dsh group [09:01:41] PROBLEM - DPKG on mw1249 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:03:22] PROBLEM - Check systemd state on mw1247 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:03:22] PROBLEM - Check whether ferm is active by checking the default input chain on mw1248 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:03:22] PROBLEM - dhclient process on mw1248 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:03:22] PROBLEM - Disk space on mw1249 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:03:22] PROBLEM - nutcracker port on mw1249 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:04:51] ^reimage, silencing [09:05:11] PROBLEM - mediawiki-installation DSH group on mw1248 is CRITICAL: Host mw1248 is not in mediawiki-installation dsh group [09:05:11] PROBLEM - DPKG on mw1248 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:08:52] (03PS1) 10Vgutierrez: Rename lvs[2004-2006] interface dependent hostnames [dns] - 10https://gerrit.wikimedia.org/r/428888 (https://phabricator.wikimedia.org/T191897) [09:09:45] !log stopping db1090 for maintenance [09:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:42] RECOVERY - Check size of conntrack table on mw1247 is OK: OK: nf_conntrack is 0 % full [09:12:32] RECOVERY - Disk space on mw1249 is OK: DISK OK [09:12:42] RECOVERY - configured eth on mw1248 is OK: OK - interfaces up [09:12:42] RECOVERY - DPKG on mw1249 is OK: All packages OK [09:13:12] RECOVERY - DPKG on mw1248 is OK: All packages OK [09:13:32] RECOVERY - Check whether ferm is active by checking the default input chain on mw1248 is OK: OK ferm input default policy is set [09:13:32] RECOVERY - dhclient process on mw1248 is OK: PROCS OK: 0 processes with command name dhclient [09:14:31] RECOVERY - Check whether ferm is active by checking the default input chain on ping1001 is OK: OK ferm input default policy is set [09:14:51] RECOVERY - Check systemd state on ping1001 is OK: OK - running: The system is fully operational [09:15:22] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#4156588 (10Scoopfinder) [09:16:25] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#3992110 (10Scoopfinder) [09:16:41] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#3992110 (10Scoopfinder) [09:17:41] RECOVERY - Check systemd state on mw1247 is OK: OK - running: The system is fully operational [09:17:51] RECOVERY - Nginx local proxy to apache on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.067 second response time [09:18:21] (03CR) 10Vgutierrez: [C: 032] Reset waitIndex on etcd error 401 [debs/pybal] - 10https://gerrit.wikimedia.org/r/428303 (https://phabricator.wikimedia.org/T169765) (owner: 10Vgutierrez) [09:18:25] (03PS3) 10Vgutierrez: Reset waitIndex on etcd error 401 [debs/pybal] - 10https://gerrit.wikimedia.org/r/428303 (https://phabricator.wikimedia.org/T169765) [09:18:41] RECOVERY - nutcracker port on mw1249 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:18:51] (03PS1) 10Jcrespo: mariadb: Setup db1122 as an s2 core eqiad database [puppet] - 10https://gerrit.wikimedia.org/r/428890 (https://phabricator.wikimedia.org/T192979) [09:20:28] (03CR) 10Jcrespo: [C: 032] mariadb: Setup db1122 as an s2 core eqiad database [puppet] - 10https://gerrit.wikimedia.org/r/428890 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [09:25:04] (03PS2) 10Mark Bergsma: Move BGP constants into their own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424580 [09:25:07] (03PS2) 10Mark Bergsma: Split off bgp.FSM into its own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424581 [09:25:36] bah valentin just ahead of me, i need to rebase again ;-) [09:25:53] /o\ [09:26:16] (03PS3) 10Mark Bergsma: Move BGP constants into their own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424580 [09:26:18] (03PS3) 10Mark Bergsma: Split off bgp.FSM into its own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424581 [09:31:43] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1248 is OK: OK: synced at Wed 2018-04-25 09:31:35 UTC. [09:32:38] (03CR) 10Mark Bergsma: [C: 031] Move BGP constants into their own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424580 (owner: 10Mark Bergsma) [09:32:54] (03PS1) 10Jcrespo: mariadb: Allow only reimage of db1116, db1120 + upgrade of >db1089 [puppet] - 10https://gerrit.wikimedia.org/r/428891 (https://phabricator.wikimedia.org/T192979) [09:37:22] (03CR) 10Vgutierrez: [C: 031] Split off bgp.FSM into its own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424581 (owner: 10Mark Bergsma) [09:37:50] (03CR) 10Marostegui: mariadb: Allow only reimage of db1116, db1120 + upgrade of >db1089 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428891 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [09:38:28] (03CR) 10Vgutierrez: [C: 031] Move BGP constants into their own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424580 (owner: 10Mark Bergsma) [09:38:46] (03CR) 10Mark Bergsma: [C: 032] Move BGP constants into their own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424580 (owner: 10Mark Bergsma) [09:39:17] (03Merged) 10jenkins-bot: Move BGP constants into their own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424580 (owner: 10Mark Bergsma) [09:41:04] (03CR) 10Jcrespo: "From the comment:" [puppet] - 10https://gerrit.wikimedia.org/r/428891 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [09:42:16] (03PS6) 10Ema: VCL: only parse X-Connection-Properties if available [puppet] - 10https://gerrit.wikimedia.org/r/428580 [09:42:22] (03CR) 10Mark Bergsma: [C: 032] Split off bgp.FSM into its own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424581 (owner: 10Mark Bergsma) [09:42:54] (03Merged) 10jenkins-bot: Split off bgp.FSM into its own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424581 (owner: 10Mark Bergsma) [09:43:23] (03CR) 10Marostegui: [C: 031] "Thanks, missed that part :)" [puppet] - 10https://gerrit.wikimedia.org/r/428891 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [09:43:43] (03CR) 10Jcrespo: [C: 032] mariadb: Allow only reimage of db1116, db1120 + upgrade of >db1089 [puppet] - 10https://gerrit.wikimedia.org/r/428891 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [09:44:51] (03CR) 10Marostegui: "Will you do the prometheus addition in a different commit? Just asking to make sure it is not forgotten :)" [puppet] - 10https://gerrit.wikimedia.org/r/428890 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [09:45:30] (03CR) 10jenkins-bot: Drop MEDIAWIKI_DIRECTORY_REGEX & MEDIAWIKI_VERSION_REGEX unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428845 (owner: 10Chad) [09:45:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428876 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [09:45:38] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3316" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428873 (owner: 10Marostegui) [09:45:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428877 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:45:48] (03CR) 10jenkins-bot: mariadb: Depool db1090 to upgrade it and clone it to db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428882 (owner: 10Jcrespo) [09:48:28] (03PS1) 10Jcrespo: mariadb-auto_reimage: Reimage db1090 into stretch [puppet] - 10https://gerrit.wikimedia.org/r/428892 [09:48:36] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428893 [09:48:39] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428893 [09:49:03] (03CR) 10Jcrespo: [C: 032] mariadb-auto_reimage: Reimage db1090 into stretch [puppet] - 10https://gerrit.wikimedia.org/r/428892 (owner: 10Jcrespo) [09:49:28] (03CR) 10Elukey: [C: 031] Switch scap proxy in A7 to mw1268 [puppet] - 10https://gerrit.wikimedia.org/r/428655 (owner: 10Muehlenhoff) [09:50:03] (03CR) 10Elukey: [C: 031] Switch scap proxy in B6 to mw1285 [puppet] - 10https://gerrit.wikimedia.org/r/428683 (owner: 10Muehlenhoff) [09:50:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428893 (owner: 10Marostegui) [09:51:06] (03PS5) 10Jcrespo: base: Disable atop daemon everywhere [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) [09:51:10] (03CR) 10Gilles: [C: 031] Remove obsolete mediawiki::packages::fonts from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428886 (owner: 10Muehlenhoff) [09:51:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428893 (owner: 10Marostegui) [09:51:47] (03CR) 10Jcrespo: [C: 032] base: Disable atop daemon everywhere [puppet] - 10https://gerrit.wikimedia.org/r/428579 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [09:52:41] (03PS1) 10Alexandros Kosiaris: Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428894 (https://phabricator.wikimedia.org/T150532) [09:53:40] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428895 [09:53:50] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428895 [09:54:06] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1085" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428893 (owner: 10Marostegui) [09:54:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1085 after alter table (duration: 01m 30s) [09:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:39] (03PS2) 10Alexandros Kosiaris: Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428894 (https://phabricator.wikimedia.org/T150532) [09:54:41] (03PS1) 10Alexandros Kosiaris: Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428896 (https://phabricator.wikimedia.org/T150532) [09:54:43] (03PS1) 10Alexandros Kosiaris: Add poolcounter1003 to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428897 (https://phabricator.wikimedia.org/T150532) [09:55:11] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428895 (owner: 10Marostegui) [09:56:47] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428895 (owner: 10Marostegui) [09:58:04] !log reimage analytics106[1,2] to Debian Stretch [09:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 (duration: 01m 16s) [09:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:42] (03CR) 10Jcrespo: [C: 032] "I will, I don't want to add it yet (same with dblists) until the server is up." [puppet] - 10https://gerrit.wikimedia.org/r/428890 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [10:00:41] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428895 (owner: 10Marostegui) [10:01:43] RECOVERY - mediawiki-installation DSH group on mw1249 is OK: OK [10:02:41] (03PS1) 10Alexandros Kosiaris: Install params for poolcounter1003 [puppet] - 10https://gerrit.wikimedia.org/r/428898 (https://phabricator.wikimedia.org/T187297) [10:05:13] RECOVERY - mediawiki-installation DSH group on mw1248 is OK: OK [10:13:10] (03PS1) 10Elukey: Add the possibility to configure UDF blacklist in Hive 2 server [puppet/cdh] - 10https://gerrit.wikimedia.org/r/428899 [10:13:21] (03PS2) 10Muehlenhoff: Don't include mediawiki::multimedia on labweb* [puppet] - 10https://gerrit.wikimedia.org/r/428298 [10:15:01] (03PS2) 10Alexandros Kosiaris: Install params for poolcounter1003 [puppet] - 10https://gerrit.wikimedia.org/r/428898 (https://phabricator.wikimedia.org/T187297) [10:15:08] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Install params for poolcounter1003 [puppet] - 10https://gerrit.wikimedia.org/r/428898 (https://phabricator.wikimedia.org/T187297) (owner: 10Alexandros Kosiaris) [10:16:00] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4156787 (10Vgutierrez) @Cmjohnson we will go with stretch and raid1-lvm (modules/install_server/files/autoinstall/netboot.cfg). Could you add the production dns entries for l... [10:19:15] !starting a slow rolling reboot of all VMs on eqiad ganeti cluster, row_A nodegroup, excluding bohrium. T150532 [10:19:16] T150532: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532 [10:23:29] (03CR) 10EddieGP: [C: 031] Apache: Move all private wikis to a single vhost block [puppet] - 10https://gerrit.wikimedia.org/r/422571 (owner: 10Chad) [10:27:48] I'd appreciate if anyone could do https://gerrit.wikimedia.org/r/#/c/425967/ . Unfortunately I couldn't be here for puppet swat yesterday, and won't be able to tomorrow either. [10:28:30] (03PS5) 10Mark Bergsma: Introduce server.is_pooled and make server.pooled usage more consistent [debs/pybal] - 10https://gerrit.wikimedia.org/r/421053 [10:28:32] (03PS1) 10Mark Bergsma: Rename server.pooled to .pool to indicate intent [debs/pybal] - 10https://gerrit.wikimedia.org/r/428900 [10:28:34] (03PS1) 10Mark Bergsma: Remove server.is_pooled as it isn't actually used [debs/pybal] - 10https://gerrit.wikimedia.org/r/428901 [10:28:47] (03PS1) 10Alexandros Kosiaris: Install params for proton[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/428902 (https://phabricator.wikimedia.org/T192983) [10:29:19] !log stopping replication, running optimize table on dbstore2001:s8 [10:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:26] (03CR) 10jerkins-bot: [V: 04-1] Install params for proton[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/428902 (https://phabricator.wikimedia.org/T192983) (owner: 10Alexandros Kosiaris) [10:48:27] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4156925 (10jcrespo) with https://gerrit.wikimedia.org/r/428579 deployed, we could close this as resolved, and reevaluate later if to drop the package entirely or to ree... [10:49:45] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4156928 (10Marostegui) 05Open>03Resolved a:03jcrespo [10:50:25] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10Marostegui) For easy access: Bug submitted to Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=896767 Bug submitted to upstream: upstream: https://... [10:55:15] (03PS4) 10EddieGP: Remove wikipedia.org vhost [puppet] - 10https://gerrit.wikimedia.org/r/398396 [10:57:02] (03CR) 10EddieGP: "Actually I agree with what Krinkle wrote. Let's make wikipedia.org > www.wikipedia.org a plain redirect." [puppet] - 10https://gerrit.wikimedia.org/r/398396 (owner: 10EddieGP) [11:01:27] eddiegp: I am checking https://gerrit.wikimedia.org/r/#/c/425967/, but the commit description puzzles me - isn't this code only for terbium/wasat ? [11:03:52] (03PS2) 10Elukey: mediawiki: Disable updateArticleCount cron [puppet] - 10https://gerrit.wikimedia.org/r/425967 (https://phabricator.wikimedia.org/T192139) (owner: 10EddieGP) [11:03:57] (03PS3) 10Elukey: mediawiki: Disable updateArticleCount cron [puppet] - 10https://gerrit.wikimedia.org/r/425967 (https://phabricator.wikimedia.org/T192139) (owner: 10EddieGP) [11:04:07] (03CR) 10Jcrespo: [C: 031] mediawiki: Disable updateArticleCount cron [puppet] - 10https://gerrit.wikimedia.org/r/425967 (https://phabricator.wikimedia.org/T192139) (owner: 10EddieGP) [11:04:09] I update the commit msg [11:04:55] (03CR) 10Elukey: [C: 032] mediawiki: Disable updateArticleCount cron [puppet] - 10https://gerrit.wikimedia.org/r/425967 (https://phabricator.wikimedia.org/T192139) (owner: 10EddieGP) [11:06:24] elukey: Yes, you're! Sorry for the confusion. [11:06:37] *you're right even [11:08:01] np! Merged and ran on terbium/wasat [11:09:32] eddiegp: another side note - as it was discovered by other opsens adding inline comments to httpd's config might lead to unexpected results (for example ServerAlias doesn't stop when it sees a "#") [11:09:57] I still need to open a bug upstream, but in the meantime let's try to avoid them [11:10:10] (I saw one in a code review for a rewrite rule) [11:11:45] elukey: Good catch, I'll have a look at my open apache changes. Could easily be that I used that somewhere, not sure where though. [11:11:57] And thanks for the merge! [11:12:01] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1122, add to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/428903 (https://phabricator.wikimedia.org/T192979) [11:12:06] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1970 bytes in 0.093 second response time [11:12:52] eddiegp: thank you for the cleanup work :) [11:13:04] :) [11:13:25] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db1122, add to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/428903 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [11:15:47] (03PS1) 10Jcrespo: mariadb: Add db1122 to s2 host list [software] - 10https://gerrit.wikimedia.org/r/428904 [11:17:41] (03CR) 10Jcrespo: [C: 032] mariadb: Add db1122 to s2 host list [software] - 10https://gerrit.wikimedia.org/r/428904 (owner: 10Jcrespo) [11:19:35] !log reimaging mw1228, mw1229, mw1230 (API servers) to stretch [11:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:21] (03PS1) 10Jcrespo: mariadb: Add, but not pool yet, new server db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428906 (https://phabricator.wikimedia.org/T192979) [11:22:06] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1974 bytes in 0.093 second response time [11:23:44] (03CR) 10Jcrespo: [C: 032] mariadb: Add, but not pool yet, new server db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428906 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [11:24:58] (03Merged) 10jenkins-bot: mariadb: Add, but not pool yet, new server db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428906 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [11:25:27] (03CR) 10jenkins-bot: mariadb: Add, but not pool yet, new server db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428906 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [11:29:16] !log jynus@tin Synchronized wmf-config/db-codfw.php: Add db1122 (duration: 03m 24s) [11:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:35] PROBLEM - HTTP on install1002 is CRITICAL: connect to address 208.80.154.22 and port 80: Connection refused [11:29:45] PROBLEM - TFTP service on install1002 is CRITICAL: Return code of 255 is out of bounds [11:29:45] PROBLEM - Check whether ferm is active by checking the default input chain on install1002 is CRITICAL: Return code of 255 is out of bounds [11:29:51] 1228,29,30 failed as expected [11:29:55] PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. [11:30:35] RECOVERY - HTTP on install1002 is OK: HTTP OK: HTTP/1.1 302 Moved Temporarily - 381 bytes in 0.001 second response time [11:30:45] RECOVERY - TFTP service on install1002 is OK: PROCS OK: 1 process with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* [11:30:45] RECOVERY - Check whether ferm is active by checking the default input chain on install1002 is OK: OK ferm input default policy is set [11:31:55] RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational [11:31:55] PROBLEM - MariaDB Slave Lag: s6 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6946.37 seconds [11:32:11] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Add db1122 (duration: 01m 16s) [11:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:49] I guess db1102 s6 is an expired downtime due to a schema change, marostegui? [11:57:15] PROBLEM - Disk space on logstash1007 is CRITICAL: Return code of 255 is out of bounds [11:57:35] PROBLEM - Check size of conntrack table on logstash1007 is CRITICAL: Return code of 255 is out of bounds [11:57:36] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused [11:57:55] PROBLEM - puppet last run on logstash1007 is CRITICAL: CRITICAL: Puppet has 11 failures. Last run 1 minute ago with 11 failures. Failed resources (up to 3 shown): Service[ssh],Service[exim4],Service[prometheus-elasticsearch-exporter],Service[kibana] [11:58:15] (03CR) 10Mark Bergsma: [C: 031] "Ema indicated a preference for renaming .pooled to .pool, and not having .is_pooled at all as it's not actually needed in production code." [debs/pybal] - 10https://gerrit.wikimedia.org/r/421053 (owner: 10Mark Bergsma) [11:58:33] RECOVERY - Check size of conntrack table on logstash1007 is OK: OK: nf_conntrack is 0 % full [11:58:42] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 [12:00:03] RECOVERY - Disk space on logstash1007 is OK: DISK OK [12:02:53] RECOVERY - puppet last run on logstash1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:06:13] PROBLEM - logstash log4j TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 4560: Connection refused [12:07:13] RECOVERY - logstash log4j TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4560 [12:08:22] !log reimaging mw1251, mw1252, mw1253 (app servers) to stretch [12:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:48] (03PS2) 10Muehlenhoff: Switch scap proxy in A7 to mw1268 [puppet] - 10https://gerrit.wikimedia.org/r/428655 [12:14:51] (03CR) 10BBlack: [C: 031] VCL: only parse X-Connection-Properties if available [puppet] - 10https://gerrit.wikimedia.org/r/428580 (owner: 10Ema) [12:15:43] (03CR) 10Muehlenhoff: [C: 032] Switch scap proxy in A7 to mw1268 [puppet] - 10https://gerrit.wikimedia.org/r/428655 (owner: 10Muehlenhoff) [12:15:45] (03CR) 10BBlack: [C: 031] VCL: 400 on empty/unparseable Host header values [puppet] - 10https://gerrit.wikimedia.org/r/428594 (owner: 10Ema) [12:21:27] (03PS2) 10EddieGP: mediawiki: Remove updateArticleCount cron [puppet] - 10https://gerrit.wikimedia.org/r/425968 (https://phabricator.wikimedia.org/T192139) [12:23:57] (03CR) 10Ema: [C: 031] Rename server.pooled to .pool to indicate intent [debs/pybal] - 10https://gerrit.wikimedia.org/r/428900 (owner: 10Mark Bergsma) [12:25:31] (03CR) 10Ema: [C: 031] Remove server.is_pooled as it isn't actually used [debs/pybal] - 10https://gerrit.wikimedia.org/r/428901 (owner: 10Mark Bergsma) [12:27:13] RECOVERY - MariaDB Slave Lag: s6 on db1102 is OK: OK slave_sql_lag Replication lag: 0.14 seconds [12:28:12] (03PS2) 10Alexandros Kosiaris: Install params for proton[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/428902 (https://phabricator.wikimedia.org/T192983) [12:28:48] (03CR) 10jerkins-bot: [V: 04-1] Install params for proton[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/428902 (https://phabricator.wikimedia.org/T192983) (owner: 10Alexandros Kosiaris) [12:31:28] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Removing jenkins-bots -1, it's about including standard, which is fine for now, we will undo it anyway soon" [puppet] - 10https://gerrit.wikimedia.org/r/428902 (https://phabricator.wikimedia.org/T192983) (owner: 10Alexandros Kosiaris) [12:34:16] PROBLEM - Disk space on mw1253 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:34:16] PROBLEM - nutcracker port on mw1253 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:35:56] PROBLEM - HHVM processes on mw1253 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:35:56] PROBLEM - nutcracker process on mw1253 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:37:45] PROBLEM - HHVM rendering on mw1253 is CRITICAL: connect to address 10.64.48.88 and port 80: Connection refused [12:37:45] PROBLEM - puppet last run on mw1253 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:40:05] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: 1.668e+04 ge 1.5e+04 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:41:05] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: (C)1.5e+04 ge (W)1e+04 ge 4781 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:42:55] PROBLEM - Apache HTTP on mw1253 is CRITICAL: connect to address 10.64.48.88 and port 80: Connection refused [12:43:55] ^ silencing [12:44:35] PROBLEM - Nginx local proxy to apache on mw1253 is CRITICAL: connect to address 10.64.48.88 and port 443: Connection refused [12:44:35] PROBLEM - Check size of conntrack table on mw1253 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:45:51] jynus: yeah, sorry, expired downtime on db1102 [12:46:09] Going to downtime it again [12:46:33] Ah, actually it finished [12:46:55] RECOVERY - Apache HTTP on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [12:47:09] !log reboot puppetdb1001 for T150532 [12:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:16] T150532: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532 [12:50:01] jouncebot, next [12:50:01] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180425T1300) [12:51:06] (03CR) 10Vgutierrez: [C: 031] Remove server.is_pooled as it isn't actually used [debs/pybal] - 10https://gerrit.wikimedia.org/r/428901 (owner: 10Mark Bergsma) [12:51:35] (03CR) 10Vgutierrez: [C: 031] Rename server.pooled to .pool to indicate intent [debs/pybal] - 10https://gerrit.wikimedia.org/r/428900 (owner: 10Mark Bergsma) [12:53:01] 10Operations, 10Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#4157400 (10Aklapper) [12:55:05] RECOVERY - HHVM processes on mw1253 is OK: PROCS OK: 1 process with command name hhvm [12:55:25] RECOVERY - Disk space on mw1253 is OK: DISK OK [12:55:36] RECOVERY - Check size of conntrack table on mw1253 is OK: OK: nf_conntrack is 0 % full [12:55:45] !log starting elasticsearch codfw rolling restart for plugin update and NUMA config - T191543 / T191236 [12:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:52] T191543: Deploy updated search/extra plugin with Slovak Stemmer - https://phabricator.wikimedia.org/T191543 [12:55:53] T191236: Resolve elasticsearch latency alerts - https://phabricator.wikimedia.org/T191236 [12:57:45] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:58:45] RECOVERY - Nginx local proxy to apache on mw1253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 4.663 second response time [12:58:54] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 73517 bytes in 6.932 second response time [12:59:26] (03PS1) 10Jcrespo: mariadb: Repool with low load db1090, db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428913 (https://phabricator.wikimedia.org/T192979) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180425T1300). [13:00:04] Urbanecm and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] present [13:00:16] o/ [13:00:20] I can SWAT today [13:01:05] Amir1: you can start with your config change while I get ready, then you have backports, I guess we can deploy in parallel [13:01:16] zeljkof: cool [13:01:34] Amir1: let me know when you are done with the config change [13:01:42] sure [13:01:43] (03PS2) 10Ladsgroup: Remove xx-uca-fa for Persian Wikis except Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428626 [13:01:45] (03PS1) 10Filippo Giunchedi: Add puppetization for mcrouter_exporter [puppet] - 10https://gerrit.wikimedia.org/r/428914 (https://phabricator.wikimedia.org/T192763) [13:02:23] (03CR) 10jerkins-bot: [V: 04-1] Add puppetization for mcrouter_exporter [puppet] - 10https://gerrit.wikimedia.org/r/428914 (https://phabricator.wikimedia.org/T192763) (owner: 10Filippo Giunchedi) [13:02:40] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428626 (owner: 10Ladsgroup) [13:03:34] RECOVERY - nutcracker port on mw1253 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:03:56] (03Merged) 10jenkins-bot: Remove xx-uca-fa for Persian Wikis except Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428626 (owner: 10Ladsgroup) [13:04:03] (03PS2) 10Filippo Giunchedi: Add puppetization for mcrouter_exporter [puppet] - 10https://gerrit.wikimedia.org/r/428914 (https://phabricator.wikimedia.org/T192763) [13:04:05] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.123 second response time [13:04:14] RECOVERY - nutcracker process on mw1253 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [13:04:43] (03CR) 10jerkins-bot: [V: 04-1] Add puppetization for mcrouter_exporter [puppet] - 10https://gerrit.wikimedia.org/r/428914 (https://phabricator.wikimedia.org/T192763) (owner: 10Filippo Giunchedi) [13:05:21] (03CR) 10jenkins-bot: Remove xx-uca-fa for Persian Wikis except Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428626 (owner: 10Ladsgroup) [13:05:44] zeljkof: the config change is not correct, I need to make a follow up :/ [13:05:46] sorry [13:05:52] Amir1: ok [13:06:04] !log Deploy schema change on s2 codfw master (db2035) - this will generate lag on codfw - T191519 T188299 T190148 [13:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:12] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [13:06:12] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [13:06:12] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [13:06:14] (03PS7) 10Ema: VCL: only parse X-Connection-Properties if available [puppet] - 10https://gerrit.wikimedia.org/r/428580 [13:07:16] (03CR) 10Ema: [C: 032] VCL: only parse X-Connection-Properties if available [puppet] - 10https://gerrit.wikimedia.org/r/428580 (owner: 10Ema) [13:07:30] (03PS1) 10Ladsgroup: Use the right uca for Persian Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428916 [13:07:48] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428916 (owner: 10Ladsgroup) [13:09:12] (03Merged) 10jenkins-bot: Use the right uca for Persian Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428916 (owner: 10Ladsgroup) [13:12:07] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:428626|Remove xx-uca-fa for Persian Wikis except Wikipedia]] (duration: 01m 17s) [13:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:16] (03CR) 10jenkins-bot: Use the right uca for Persian Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428916 (owner: 10Ladsgroup) [13:12:53] zeljkof: done [13:14:00] Amir1: ok, deploying a couple of config changes, I guess you can merge your backports, they could take a while [13:14:17] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428878 (https://phabricator.wikimedia.org/T192898) (owner: 10Urbanecm) [13:14:31] Urbanecm: deploying throttle commit first, 428878 [13:14:32] zeljkof: well, they will fail cause phan fails on branches (the same old issue) [13:14:36] (03CR) 10Mark Bergsma: [C: 032] Introduce server.is_pooled and make server.pooled usage more consistent [debs/pybal] - 10https://gerrit.wikimedia.org/r/421053 (owner: 10Mark Bergsma) [13:14:49] Urbanecm: I will ping you when the second commit is at mwdebug [13:14:55] ack [13:15:08] (03Merged) 10jenkins-bot: Introduce server.is_pooled and make server.pooled usage more consistent [debs/pybal] - 10https://gerrit.wikimedia.org/r/421053 (owner: 10Mark Bergsma) [13:15:14] Amir1: uh oh, it's up to you :) [13:15:40] (03Merged) 10jenkins-bot: New throttle rule for cswiki Wikipedia event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428878 (https://phabricator.wikimedia.org/T192898) (owner: 10Urbanecm) [13:16:34] (03CR) 10Alexandros Kosiaris: [C: 032] Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428894 (https://phabricator.wikimedia.org/T150532) (owner: 10Alexandros Kosiaris) [13:16:38] (03CR) 10Ottomata: [C: 031] Add the possibility to configure UDF blacklist in Hive 2 server [puppet/cdh] - 10https://gerrit.wikimedia.org/r/428899 (owner: 10Elukey) [13:16:40] (03PS2) 10Zfilipin: Enable Mapframe for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428630 (https://phabricator.wikimedia.org/T192895) (owner: 10Urbanecm) [13:17:03] (03CR) 10Elukey: [C: 032] Add the possibility to configure UDF blacklist in Hive 2 server [puppet/cdh] - 10https://gerrit.wikimedia.org/r/428899 (owner: 10Elukey) [13:17:33] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:428878|New throttle rule for cswiki Wikipedia event (T192898)]] (duration: 01m 16s) [13:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:40] T192898: Please lift the IP cap on 2018-05-03 - https://phabricator.wikimedia.org/T192898 [13:17:53] Urbanecm: 428878 deployed [13:17:57] ack [13:18:13] herron: a minor scap hickup during eu swat today [13:18:28] ? [13:18:30] 13:17:29 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/throttle.php', 'mw1268.eqiad.wmnet', 'mw1284.eqiad.wmnet', 'mw1319.eqiad.wmnet', 'mw2290.codfw.wmnet', 'mw2215.codfw.wmnet', 'mw2254.codfw.wmnet', 'mw2187.codfw.wmnet', 'mw1250.eqiad.wmnet', 'mw1313.eqiad.wmnet'] on mw1230.eqiad.wmnet returned [255]: Host key verification failed. [13:18:31] (03CR) 10jenkins-bot: New throttle rule for cswiki Wikipedia event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428878 (https://phabricator.wikimedia.org/T192898) (owner: 10Urbanecm) [13:18:36] (03CR) 10jenkins-bot: Depool poolcounter1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428894 (https://phabricator.wikimedia.org/T150532) (owner: 10Alexandros Kosiaris) [13:18:39] 13:17:29 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/throttle.php', 'mw1268.eqiad.wmnet', 'mw1284.eqiad.wmnet', 'mw1319.eqiad.wmnet', 'mw2290.codfw.wmnet', 'mw2215.codfw.wmnet', 'mw2254.codfw.wmnet', 'mw2187.codfw.wmnet', 'mw1250.eqiad.wmnet', 'mw1313.eqiad.wmnet'] on mw1228.eqiad.wmnet returned [255]: Host key verification failed. [13:19:14] !log akosiaris@tin Synchronized wmf-config/ProductionServices.php: depool poolcounter1001 T150532 (duration: 01m 17s) [13:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:20] T150532: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532 [13:19:21] zeljkof: one of the scap proxies changed earlier, but puppet should have fixed that by now? [13:19:31] zeljkof: https://tools.wmflabs.org/sal/log/AWLS9gcSCdtJF08990fa [13:19:39] zeljkof: ah, no. wait [13:19:54] some of these hosts are in the reimaging process from what I gather ? [13:19:57] herron, akosiaris, moritzm: should I continue with scap? or wait? [13:20:10] (03PS3) 10Filippo Giunchedi: Add puppetization for mcrouter_exporter [puppet] - 10https://gerrit.wikimedia.org/r/428914 (https://phabricator.wikimedia.org/T192763) [13:20:20] yeah, mw1230 was reimaged earlier, but seems there was a problem with wmf-reimage, I'll set it as deactived [13:20:21] Amir1: did you get the same error messages while deploying? [13:20:29] zeljkof: give me a minute, then you can proceed [13:20:31] nope, it was fine [13:20:41] moritzm: ok, waiting cc Urbanecm [13:21:07] moritzm, zeljkof, what's happening? Something's broken? [13:21:23] (03CR) 10Jcrespo: [C: 032] mariadb: Repool with low load db1090, db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428913 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [13:21:28] zeljkof: please try again, I've set mw1228-mw1230 as deactivated [13:21:43] Urbanecm: a couple of servers say Host key verification failed. [13:21:44] jynus: wait, don't deploy that yet [13:21:58] scap issues [13:22:05] moritzm: ok, deploying [13:22:06] akosiaris: waiting [13:22:06] Ok, let's wait, plenty of time :) [13:22:36] Urbanecm: mw1228-mw1230 were reimaged, but it seems wmf-auto-reimage had a problem with the IPMI command triggering the reboot [13:23:01] but they are no longer considered by scap for now, so should be fine now [13:23:32] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:428878|New throttle rule for cswiki Wikipedia event (T192898)]] (duration: 01m 16s) [13:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:41] T192898: Please lift the IP cap on 2018-05-03 - https://phabricator.wikimedia.org/T192898 [13:23:58] akosiaris, moritzm: no problems, thanks, continuing with swat [13:24:03] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428630 (https://phabricator.wikimedia.org/T192895) (owner: 10Urbanecm) [13:24:10] ok, thanks! [13:24:11] jynus: ^ [13:24:27] zeljkof: ack, sorry the interruption [13:24:40] Urbanecm: merging 428630, will ping you when it's at mwdebug [13:24:45] ack [13:24:59] moritzm: no problem, thanks for the quick help! [13:25:04] (03PS1) 10Elukey: profile::hive::client: blacklist a UDF builtin for CVE-2018-1284 [puppet] - 10https://gerrit.wikimedia.org/r/428919 [13:25:20] (03Merged) 10jenkins-bot: Enable Mapframe for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428630 (https://phabricator.wikimedia.org/T192895) (owner: 10Urbanecm) [13:26:09] Urbanecm: 428630 is at mwdebug [13:26:22] (03CR) 10jenkins-bot: Enable Mapframe for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428630 (https://phabricator.wikimedia.org/T192895) (owner: 10Urbanecm) [13:26:27] zeljkof, going to test [13:27:21] (03CR) 10Mark Bergsma: Rename server.pooled to .pool to indicate intent (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/428900 (owner: 10Mark Bergsma) [13:27:28] (03PS1) 10Filippo Giunchedi: Initial debianization [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 [13:27:33] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11034/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/428919 (owner: 10Elukey) [13:28:37] zeljkof, working, please deploy [13:28:44] Urbanecm: deploying [13:29:43] ack [13:30:01] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:428630|Enable Mapframe for bgwiki (T192895)]] (duration: 01m 15s) [13:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:07] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4157551 (10faidon) 05Resolved>03Open My two cents: - I don't see this hiera knob used anywhere in the tree right now; has anyone expressed interest in using it in i... [13:30:07] T192895: Enable Kartographer on the Bulgarian Wikipedia - https://phabricator.wikimedia.org/T192895 [13:30:21] (03PS2) 10Elukey: profile::hive::client: blacklist a UDF builtin for CVE-2018-1284 [puppet] - 10https://gerrit.wikimedia.org/r/428919 [13:30:34] Urbanecm: 428630 is deployed, please check and thanks for deploying with #releng! :) [13:30:38] Amir1: swat is yours [13:30:49] \o/ [13:30:51] Thanks! [13:30:57] Working, thank you for the deploy! [13:31:35] (03PS2) 10Mark Bergsma: Rename server.pooled to .pool to indicate intent [debs/pybal] - 10https://gerrit.wikimedia.org/r/428900 [13:31:37] (03PS2) 10Mark Bergsma: Remove server.is_pooled as it isn't actually used [debs/pybal] - 10https://gerrit.wikimedia.org/r/428901 [13:33:48] (03CR) 10Mark Bergsma: [C: 032] Rename server.pooled to .pool to indicate intent [debs/pybal] - 10https://gerrit.wikimedia.org/r/428900 (owner: 10Mark Bergsma) [13:34:12] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11035/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/428919 (owner: 10Elukey) [13:34:19] (03Merged) 10jenkins-bot: Rename server.pooled to .pool to indicate intent [debs/pybal] - 10https://gerrit.wikimedia.org/r/428900 (owner: 10Mark Bergsma) [13:34:33] (03CR) 10Elukey: [C: 032] profile::hive::client: blacklist a UDF builtin for CVE-2018-1284 [puppet] - 10https://gerrit.wikimedia.org/r/428919 (owner: 10Elukey) [13:38:14] (03CR) 10Gehel: Set SPARQL services to use internal cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [13:38:55] (03PS2) 10Ottomata: Set PXE boot to Debian Stretch for kafka[12]00[123] [puppet] - 10https://gerrit.wikimedia.org/r/428575 (https://phabricator.wikimedia.org/T192832) (owner: 10Elukey) [13:40:22] (03PS7) 10Ema: VCL: 400 on empty/unparseable Host header values [puppet] - 10https://gerrit.wikimedia.org/r/428594 [13:40:46] !log reboot poolcounter1001 for T150532 [13:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:54] T150532: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532 [13:40:55] (03CR) 10Ottomata: [C: 032] Set PXE boot to Debian Stretch for kafka[12]00[123] [puppet] - 10https://gerrit.wikimedia.org/r/428575 (https://phabricator.wikimedia.org/T192832) (owner: 10Elukey) [13:42:38] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4157574 (10Marostegui) I would prefer option #2 (remove atop). My reasoning for it is that we now have to remove "-R" from it, but what could happen in the future? Mayb... [13:43:23] !log ladsgroup@tin Synchronized php-1.31.0-wmf.30/extensions/Wikibase/lib/includes/Changes: [[gerrit:428907|Make sure statements in EntityDiffChangedAspects are not passed around as stdClass (T192085)]] (duration: 01m 17s) [13:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:29] T192085: PHP Fatal in AffectedPagesFinder::getChangedAspects - https://phabricator.wikimedia.org/T192085 [13:43:55] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4157581 (10jcrespo) I tried to do the least amount of impact regarding atop, and offer a way to enable it to who could complain about it. If I was the one to decide, I... [13:44:21] I need some minutes to make sure this doesn't make the infra to fall over [13:44:59] (03PS8) 10Ema: VCL: 400 on empty/unparseable Host header values [puppet] - 10https://gerrit.wikimedia.org/r/428594 [13:45:47] (03CR) 10Ema: [C: 032] VCL: 400 on empty/unparseable Host header values [puppet] - 10https://gerrit.wikimedia.org/r/428594 (owner: 10Ema) [13:45:49] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428896 (https://phabricator.wikimedia.org/T150532) (owner: 10Alexandros Kosiaris) [13:46:41] (03PS2) 10Alexandros Kosiaris: Add poolcounter1003 to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428897 (https://phabricator.wikimedia.org/T187297) [13:47:03] (03Merged) 10jenkins-bot: Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428896 (https://phabricator.wikimedia.org/T150532) (owner: 10Alexandros Kosiaris) [13:47:57] (03CR) 10jerkins-bot: [V: 04-1] Add poolcounter1003 to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428897 (https://phabricator.wikimedia.org/T187297) (owner: 10Alexandros Kosiaris) [13:48:06] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10akosiaris) > If I was the one to decide, I would personally remove it from everwhere, too. FWIW, this has my +1. [13:49:48] !log akosiaris@tin Synchronized wmf-config/ProductionServices.php: repool poolcounter1001 T150532 (duration: 01m 16s) [13:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:55] T150532: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532 [13:51:12] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4032605 (10Pchelolo) [13:51:38] (03PS3) 10Alexandros Kosiaris: Add poolcounter1003 to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428897 (https://phabricator.wikimedia.org/T187297) [13:53:20] !log ladsgroup@tin Synchronized php-1.32.0-wmf.1/extensions/Wikibase/lib/includes/Changes: [[gerrit:428908|Make sure statements in EntityDiffChangedAspects are not passed around as stdClass (T192085)]] (duration: 01m 16s) [13:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:26] T192085: PHP Fatal in AffectedPagesFinder::getChangedAspects - https://phabricator.wikimedia.org/T192085 [13:53:48] (03CR) 10Alexandros Kosiaris: [C: 032] Add poolcounter1003 to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428897 (https://phabricator.wikimedia.org/T187297) (owner: 10Alexandros Kosiaris) [13:53:49] !log EU SWAT is done! [13:53:51] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#4157629 (10Jgreen) No display output after the host started pxeboot sequence, turns out it needed "Redirection After Boot" enabled in BIOS. [13:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:59] (03Merged) 10jenkins-bot: Add poolcounter1003 to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428897 (https://phabricator.wikimedia.org/T187297) (owner: 10Alexandros Kosiaris) [13:57:18] !log akosiaris@tin Synchronized wmf-config/ProductionServices.php: pool poolcounter1003 T187297 (duration: 01m 16s) [13:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:26] T187297: VM for poolcounter1002 - https://phabricator.wikimedia.org/T187297 [13:57:35] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4157635 (10Marostegui) [13:58:09] (03CR) 10Muehlenhoff: "Looks good, some random comments" (035 comments) [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 (owner: 10Filippo Giunchedi) [14:02:55] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157665 (10dcausse) I don't have strong opinions on which wikis we should migrate next. My sole concerns right now is regarding write freezes when we resta... [14:02:59] (03PS1) 10Muehlenhoff: Reimage mwdebug servers with stretch [puppet] - 10https://gerrit.wikimedia.org/r/428923 (https://phabricator.wikimedia.org/T174431) [14:04:03] 10Operations, 10vm-requests, 10Patch-For-Review: VM for poolcounter1002 - https://phabricator.wikimedia.org/T187297#4157669 (10akosiaris) 05Open>03Resolved a:03akosiaris poolcounter1003 is up and running fine and serving connections for the mediawiki fleet. I 'll resolve this and create a decom task fo... [14:04:18] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4157672 (10jcrespo) [14:05:37] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157685 (10mobrovac) >>! In T189137#4157665, @dcausse wrote: > I don't have strong opinions on which wikis we should migrate next. group1 could be a good... [14:05:41] (03PS1) 10Alexandros Kosiaris: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428924 (https://phabricator.wikimedia.org/T193025) [14:08:56] (03PS1) 10Alexandros Kosiaris: Assign role spare to poolcounter1002 [puppet] - 10https://gerrit.wikimedia.org/r/428925 (https://phabricator.wikimedia.org/T193025) [14:09:28] (03CR) 10Alexandros Kosiaris: [C: 032] Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428924 (https://phabricator.wikimedia.org/T193025) (owner: 10Alexandros Kosiaris) [14:09:32] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157701 (10Pchelolo) The subtasks that were created to fix issues discovered during the first iteration of the switch were resolved, and I don't see any lo... [14:09:36] (03CR) 10Hoo man: Set SPARQL services to use internal cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428722 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [14:10:11] 10Operations, 10Code-Stewardship-Reviews, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4157703 (10faidon) @danstillman this is very useful information (and good news!), thank you for the detailed updated! It still seems like the option... [14:10:43] (03Merged) 10jenkins-bot: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428924 (https://phabricator.wikimedia.org/T193025) (owner: 10Alexandros Kosiaris) [14:11:31] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157706 (10dcausse) >>! In T189137#4157685, @mobrovac wrote: >>>! In T189137#4157665, @dcausse wrote: >> I don't have strong opinions on which wikis we sho... [14:12:05] (03CR) 10Alexandros Kosiaris: [C: 032] Assign role spare to poolcounter1002 [puppet] - 10https://gerrit.wikimedia.org/r/428925 (https://phabricator.wikimedia.org/T193025) (owner: 10Alexandros Kosiaris) [14:12:16] (03PS1) 10Ottomata: Add IPv6 entries for kafka[12]00[123] [dns] - 10https://gerrit.wikimedia.org/r/428926 (https://phabricator.wikimedia.org/T192832) [14:12:47] !log akosiaris@tin Synchronized wmf-config/ProductionServices.php: depool poolcounter1002 T193025 (duration: 01m 16s) [14:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:53] T193025: Decommision poolcounter1002 - https://phabricator.wikimedia.org/T193025 [14:13:12] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157711 (10mobrovac) Given the numbers above, going with everything but enwiki, wikidata and commons should be a good next round. [14:14:19] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157713 (10Pchelolo) > When we freeze writes we start to push ElasticaWrite jobs that contain the full page doc which can be relatively large. We had to ra... [14:16:16] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommision poolcounter1002 - https://phabricator.wikimedia.org/T193025#4157673 (10akosiaris) [14:16:58] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4032605 (10Ottomata) I already feel like 4Mb messages are a lot, and would much prefer not to increase the max message size more. Can these jobs be split up? [14:17:21] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157722 (10dcausse) >>! In T189137#4157713, @Pchelolo wrote: >> When we freeze writes we start to push ElasticaWrite jobs that contain the full page doc wh... [14:25:27] (03CR) 10jenkins-bot: Revert "Depool poolcounter1001" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428896 (https://phabricator.wikimedia.org/T150532) (owner: 10Alexandros Kosiaris) [14:25:33] (03CR) 10jenkins-bot: Add poolcounter1003 to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428897 (https://phabricator.wikimedia.org/T187297) (owner: 10Alexandros Kosiaris) [14:25:38] (03CR) 10jenkins-bot: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428924 (https://phabricator.wikimedia.org/T193025) (owner: 10Alexandros Kosiaris) [14:26:23] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157730 (10Gehel) >>! In T189137#4157706, @dcausse wrote: >>>! In T189137#4157685, @mobrovac wrote: >>>>! In T189137#4157665, @dcausse wrote: >>> My sole c... [14:26:27] (03PS1) 10Ottomata: Add add_ip6_mapped to main-codfw hosts. [puppet] - 10https://gerrit.wikimedia.org/r/428928 (https://phabricator.wikimedia.org/T192832) [14:26:45] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157731 (10Pchelolo) > If there is a way to monitor such errors I guess we can pick-up known large pages and modify them while the write are frozen? There... [14:26:58] (03CR) 10jerkins-bot: [V: 04-1] Add add_ip6_mapped to main-codfw hosts. [puppet] - 10https://gerrit.wikimedia.org/r/428928 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [14:32:59] !log cp3030: upgrade varnish to 5.1.3-1wm7 T192368 [14:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:06] T192368: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368 [14:34:12] !log reboot bohrium T150532 [14:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:19] T150532: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532 [14:34:57] 10Operations, 10Patch-For-Review: Upgrade qemu on ganeti clusters to 2.8 - https://phabricator.wikimedia.org/T150532#4157757 (10akosiaris) 05Open>03Resolved a:03akosiaris And we are at qemu 2.8 and this can finally be closed. [14:35:12] (03PS1) 10Jcrespo: standard_packages: Remove atop for every WMF machine [puppet] - 10https://gerrit.wikimedia.org/r/428930 (https://phabricator.wikimedia.org/T192551) [14:35:24] is mediawiki deploy free again? [14:35:33] yes [14:35:41] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 4 VM request for pdf-render/proton - https://phabricator.wikimedia.org/T192983#4157761 (10akosiaris) p:05Triage>03Normal [14:36:57] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4157762 (10jcrespo) a:05jcrespo>03faidon Created T192551, because as I said, the problem was not technical. [14:36:59] !log restart hive-server2 on analytics1003 to pick up settings in https://gerrit.wikimedia.org/r/428919 [14:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:20] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4157767 (10jcrespo) I meant https://gerrit.wikimedia.org/r/428930 [14:39:19] (03CR) 10Faidon Liambotis: [C: 031] standard_packages: Remove atop for every WMF machine [puppet] - 10https://gerrit.wikimedia.org/r/428930 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [14:39:37] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" (035 comments) [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 (owner: 10Filippo Giunchedi) [14:39:53] (03PS2) 10Filippo Giunchedi: Initial debianization [debs/prometheus-mcrouter-exporter] - 10https://gerrit.wikimedia.org/r/428920 (https://phabricator.wikimedia.org/T192763) [14:40:33] (03PS1) 10Elukey: cassandra: add percentile metrics to 2.2's prometheus jmx config [puppet] - 10https://gerrit.wikimedia.org/r/428931 (https://phabricator.wikimedia.org/T193017) [14:41:43] (03CR) 10Imarlier: graphite: allow data requests from performance.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428836 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [14:42:41] (03PS2) 10Elukey: cassandra: add percentile metrics to 2.x's prometheus jmx config [puppet] - 10https://gerrit.wikimedia.org/r/428931 (https://phabricator.wikimedia.org/T193017) [14:46:07] (03CR) 10Muehlenhoff: standard_packages: Remove atop for every WMF machine (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428930 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [14:48:57] (03PS2) 10Muehlenhoff: Switch scap proxy in B6 to mw1285 [puppet] - 10https://gerrit.wikimedia.org/r/428683 [14:49:37] (03PS1) 10Cmjohnson: Removing db1039 site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/428932 (https://phabricator.wikimedia.org/T184262) [14:50:10] (03CR) 10Cmjohnson: [C: 032] Removing db1039 site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/428932 (https://phabricator.wikimedia.org/T184262) (owner: 10Cmjohnson) [14:51:19] (03CR) 10Muehlenhoff: [C: 032] Switch scap proxy in B6 to mw1285 [puppet] - 10https://gerrit.wikimedia.org/r/428683 (owner: 10Muehlenhoff) [14:51:24] (03PS3) 10Muehlenhoff: Switch scap proxy in B6 to mw1285 [puppet] - 10https://gerrit.wikimedia.org/r/428683 [14:52:23] (03PS2) 10Jcrespo: standard_packages: Remove atop from every WMF machine [puppet] - 10https://gerrit.wikimedia.org/r/428930 (https://phabricator.wikimedia.org/T192551) [14:52:43] (03CR) 10Jcrespo: standard_packages: Remove atop from every WMF machine (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428930 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [14:53:13] (03PS1) 10Ottomata: Blacklist job and change-prop topics from lag check for main -> analytics [puppet] - 10https://gerrit.wikimedia.org/r/428933 [14:53:29] (03PS2) 10Ottomata: Blacklist job and change-prop topics from lag check for main -> analytics [puppet] - 10https://gerrit.wikimedia.org/r/428933 [14:53:37] (03CR) 10Marostegui: [C: 031] standard_packages: Remove atop from every WMF machine [puppet] - 10https://gerrit.wikimedia.org/r/428930 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [14:54:35] (03CR) 10Ottomata: [C: 032] Blacklist job and change-prop topics from lag check for main -> analytics [puppet] - 10https://gerrit.wikimedia.org/r/428933 (owner: 10Ottomata) [14:57:34] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommision poolcounter1002 - https://phabricator.wikimedia.org/T193025#4157805 (10akosiaris) [14:58:53] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157809 (10Pchelolo) I've run some analysis on the logs and indeed sometimes the `cirrusSearchElasticWrite` is too large. Here're the sizes in bytes for al... [15:00:01] (03PS1) 10Muehlenhoff: Switch scap proxy for D5 to mw1251 [puppet] - 10https://gerrit.wikimedia.org/r/428934 [15:00:14] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/428931 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [15:02:25] (03PS1) 10Andrew Bogott: nova: Add labvirt1016 to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/428935 [15:02:27] (03PS1) 10Andrew Bogott: nova: repool labvirt1015 [puppet] - 10https://gerrit.wikimedia.org/r/428936 (https://phabricator.wikimedia.org/T192422) [15:03:11] (03PS2) 10Ottomata: Add add_ip6_mapped to main-codfw hosts. [puppet] - 10https://gerrit.wikimedia.org/r/428928 (https://phabricator.wikimedia.org/T192832) [15:03:37] (03CR) 10jerkins-bot: [V: 04-1] Add add_ip6_mapped to main-codfw hosts. [puppet] - 10https://gerrit.wikimedia.org/r/428928 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [15:04:02] (03PS2) 10Andrew Bogott: nova: Add labvirt1016 to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/428935 [15:04:07] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.079 second response time [15:04:17] !log adding labvirt1016 to the nova-compute scheduling pool [15:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:50] (03CR) 10Andrew Bogott: [C: 032] nova: Add labvirt1016 to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/428935 (owner: 10Andrew Bogott) [15:05:23] !log temp disabling puppet, applying ipv6 mapped on kafka200* [15:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:43] (03CR) 10Ottomata: [V: 032 C: 032] Add add_ip6_mapped to main-codfw hosts. [puppet] - 10https://gerrit.wikimedia.org/r/428928 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [15:06:00] (03PS3) 10Ottomata: Add add_ip6_mapped to main-codfw hosts. [puppet] - 10https://gerrit.wikimedia.org/r/428928 (https://phabricator.wikimedia.org/T192832) [15:06:02] (03CR) 10Ottomata: [V: 032 C: 032] Add add_ip6_mapped to main-codfw hosts. [puppet] - 10https://gerrit.wikimedia.org/r/428928 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [15:12:01] 10Operations, 10monitoring, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4157908 (10Dzahn) >>! In T192551#4150850, @Dzahn wrote: > +1 to remove the daemon/cron, keeping the package itself. I only said to keep the package out of a similar mo... [15:13:40] (03PS2) 10Andrew Bogott: nova: repool labvirt1015 [puppet] - 10https://gerrit.wikimedia.org/r/428936 (https://phabricator.wikimedia.org/T192422) [15:13:51] (03PS3) 10Elukey: cassandra: add percentile metrics to 2.x's prometheus jmx config [puppet] - 10https://gerrit.wikimedia.org/r/428931 (https://phabricator.wikimedia.org/T193017) [15:14:18] (03CR) 10Andrew Bogott: [C: 032] nova: repool labvirt1015 [puppet] - 10https://gerrit.wikimedia.org/r/428936 (https://phabricator.wikimedia.org/T192422) (owner: 10Andrew Bogott) [15:14:37] (03PS4) 10Elukey: cassandra: add percentile metrics to 2.x's prometheus jmx config [puppet] - 10https://gerrit.wikimedia.org/r/428931 (https://phabricator.wikimedia.org/T193017) [15:14:56] gehel: o/ [15:15:07] elukey: \o [15:15:30] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157931 (10Ottomata) I don't love it! I feel like 4Mb is already huge. Consider troubleshooting some problem with `kafkacat -C | jq .`. Gotta consume a... [15:15:50] gehel: do you have anything against https://gerrit.wikimedia.org/r/428931 ? [15:15:59] not sure if you guys are using the jmx exporter for cassandra [15:16:02] in the maps cluster [15:16:55] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/428931 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [15:17:27] elukey: not sure I parsed that regex correctly... [15:17:53] elukey: I don't look at those metrics as much as I should, so feel free to break whatever is on the maps side, and I'll fix it when I need it [15:17:57] s/when/if/ [15:19:56] PROBLEM - Kafka Broker Replica Max Lag on kafka1001 is CRITICAL: 5.179e+05 ge 5e+05 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001 [15:19:57] PROBLEM - Kafka Broker Replica Max Lag on kafka1003 is CRITICAL: 5.067e+05 ge 5e+05 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [15:20:26] gehel: sorry I missed to add some context :) I am working on finding a way to use the same dashboards for all the cassandra clusters, since 3.x changed metric names (Sigh). The one that I am adding is a copy from the 3.x one, that wasn't "backported" afaict [15:20:56] task is T193017 [15:20:56] T193017: Unify, if possible, AQS and Restbase's cassandra dashboards - https://phabricator.wikimedia.org/T193017 [15:21:11] ^^^ ? looking [15:21:15] 1001? [15:21:22] mabye more elasticawrite...? [15:21:40] ottomata: might be related to the cluster restart in progress... [15:22:21] !log Running populateRevisionLength.php on group 0 for T192189 [15:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:27] T192189: RevisionArchiveRecord incorrectly changes null ar_len to 0 - https://phabricator.wikimedia.org/T192189 [15:22:34] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4158003 (10Pchelolo) > Consider troubleshooting some problem with kafkacat -C | jq . Haha :) > That said, I'm not opposed, as I don't know of any practic... [15:22:37] gehel: https://grafana-admin.wikimedia.org/dashboard/db/kafka?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-cluster=eventbus&var-kafka_broker=All&from=now-6h&to=now [15:22:40] i think we have to fix this [15:22:53] when this happens, there is a huge jump in large messages [15:23:35] ottomata: looks like it matches with https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&from=now-6h&to=now&var-cluster=codfw&var-smoothing=1&panelId=64&fullscreen&refresh=1m [15:24:12] so yes, elasticsearch is probably the culprit (cc: ebernhardson dcausse) [15:24:23] (03PS2) 10Jcrespo: mariadb: Repool with low load db1090, db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428913 (https://phabricator.wikimedia.org/T192979) [15:24:33] (03CR) 10Jcrespo: [C: 032] mariadb: Repool with low load db1090, db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428913 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:25:09] ottomata: that's the first time I see this issue during an elasticsearch cluster restart (I might just have missed it all other times) [15:25:57] gehel: not sure if it is always restart, but we've seen really bursty message sizes from elasticwrite over the last few days-weekish [15:26:15] (03Merged) 10jenkins-bot: mariadb: Repool with low load db1090, db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428913 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:26:42] ottomata: we're going to need input from dcausse / ebernhardson on this [15:26:46] aye [15:27:01] i think they'we are already kinda talking about it https://phabricator.wikimedia.org/T189137#4157731 [15:27:11] it isn't super urgent, but will likely keep making alerts flap [15:27:22] (oh you are on that ticket too) [15:27:23] :) [15:27:35] gehel: shall I disable puppet on maps* and then let you run/check in there first? [15:27:46] yep, seems at least related [15:28:21] elukey: please do! In a meeting, but I'll check / re-enable soon [15:28:41] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1122, db1090 with low load (duration: 01m 14s) [15:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4158048 (10Cmjohnson) [15:30:55] (03CR) 10jenkins-bot: mariadb: Repool with low load db1090, db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428913 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [15:31:18] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4116638 (10Cmjohnson) a:05Cmjohnson>03Marostegui @marostegui db1120 is fixed, i had the ethernet cable in the wrong port :-(. Assigning to you and removing ops-eqiad tag [15:33:11] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4158084 (10Marostegui) [15:33:14] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4158081 (10Marostegui) 05Open>03Resolved a:05Marostegui>03Cmjohnson Confirmed db1120 looks good! Thanks @Cmjohnson! [15:33:29] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4158086 (10Marostegui) [15:35:15] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Create a prometheus exporter for mcrouter - https://phabricator.wikimedia.org/T192763#4158093 (10fgiunchedi) I sent some changes upstream that I think would be beneficial, https://github.com/Dev2... [15:36:25] (03CR) 10Andrew Bogott: "Reedy suggests that we might still need libvips-tools -- that include should probably be moved elsewhere" [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff) [15:38:47] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4158122 (10Cmjohnson) [15:47:35] (03PS1) 10Jcrespo: flaggedreviews-maintenance: Avoid cronspam by sending error output to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/428942 (https://phabricator.wikimedia.org/T192340) [15:47:53] (03PS2) 10Jcrespo: flaggedreviews-maintenance: Avoid cronspam by sending error output to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/428942 (https://phabricator.wikimedia.org/T192340) [15:48:12] ottomata: I still need to continue that cluster restart, and I don't have a quick fix to not send large documents to elasticawrite... how bad is it on the kafka side? [15:48:19] (03CR) 10jerkins-bot: [V: 04-1] flaggedreviews-maintenance: Avoid cronspam by sending error output to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/428942 (https://phabricator.wikimedia.org/T192340) (owner: 10Jcrespo) [15:49:24] if there is a user on wikitech wiki and it says in the logs "has been created automatically" this means they are logging in with a SUL user, right? (they can now on wikitech?) but they don't have an LDAP user. is that right? [15:49:44] the user keeps insisting they have an LDAP user (wikitech user) but i can't find them anywhere in LDAP [15:49:49] (03PS3) 10Jcrespo: flaggedreviews-maintenance: Avoid cronspam by sending error output to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/428942 (https://phabricator.wikimedia.org/T192340) [15:51:28] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4158225 (10akosiaris) [15:51:31] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 4 VM request for pdf-render/proton - https://phabricator.wikimedia.org/T192983#4158222 (10akosiaris) 05Open>03Resolved a:03akosiaris VMs are up and running, but without a role yet applied. Resolving this [15:52:29] (03CR) 10Alexandros Kosiaris: [C: 031] flaggedreviews-maintenance: Avoid cronspam by sending error output to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/428942 (https://phabricator.wikimedia.org/T192340) (owner: 10Jcrespo) [15:53:19] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4158245 (10ayounsi) [15:53:57] RECOVERY - Kafka Broker Replica Max Lag on kafka1001 is OK: (C)5e+05 ge (W)1e+05 ge 9.166e+04 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001 [15:54:07] RECOVERY - Kafka Broker Replica Max Lag on kafka1003 is OK: (C)5e+05 ge (W)1e+05 ge 9.686e+04 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [15:54:48] (03CR) 10Jcrespo: [C: 032] flaggedreviews-maintenance: Avoid cronspam by sending error output to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/428942 (https://phabricator.wikimedia.org/T192340) (owner: 10Jcrespo) [15:54:54] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4158258 (10Cmjohnson) [15:55:36] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4158271 (10Marostegui) >>! In T187962#4119429, @Marostegui wrote: >>>! In T187962#4119423, @jcrespo wrote: >> I would honestly move x1 replica (or the master d... [15:57:25] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4158278 (10jcrespo) I agree, first one will probably be a direct decommision, but next one could be used for that. [16:00:00] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324#4158304 (10jcrespo) [16:00:49] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/428931 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [16:01:48] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4158327 (10ayounsi) [16:08:02] (03PS1) 10Filippo Giunchedi: profile: install SMART checks after 'raid' fact is available. [puppet] - 10https://gerrit.wikimedia.org/r/428947 [16:09:14] !log re-imaging mw2258, mw2163, mw2164 ff. [16:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:08] (03PS2) 10Dzahn: admins: add arlolra, cscott to releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427954 (https://phabricator.wikimedia.org/T192684) [16:13:52] (03CR) 10Dzahn: [C: 032] "as requested by subbu" [puppet] - 10https://gerrit.wikimedia.org/r/427954 (https://phabricator.wikimedia.org/T192684) (owner: 10Dzahn) [16:19:38] (03PS2) 10Filippo Giunchedi: profile: install SMART checks after 'raid' fact is available. [puppet] - 10https://gerrit.wikimedia.org/r/428947 (https://phabricator.wikimedia.org/T132324) [16:20:00] (03CR) 10BryanDavis: Don't include mediawiki::multimedia on labweb* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff) [16:20:05] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: add arlo and scott to parsoid releasers admin group - https://phabricator.wikimedia.org/T192684#4158416 (10Dzahn) [releases1001:~] $ id arlolra uid=3381(arlolra) gid=500(wikidev) groups=500(wikidev),802(releasers-parsoid) [releases1001:~] $ id cscott u... [16:20:30] 10Operations, 10Parsoid, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide an archive endpoint for older Parsoid debs (on releases.wikimedia.org or elsewhere) - https://phabricator.wikimedia.org/T150672#4158420 (10Dzahn) [16:20:34] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: add arlo and scott to parsoid releasers admin group - https://phabricator.wikimedia.org/T192684#4158417 (10Dzahn) 05Open>03Resolved [16:20:49] (03PS1) 10Ladsgroup: Change fawiki's uca to the right one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428951 [16:20:59] (03CR) 10jerkins-bot: [V: 04-1] Change fawiki's uca to the right one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428951 (owner: 10Ladsgroup) [16:21:25] 10Operations, 10Parsoid, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide an archive endpoint for older Parsoid debs (on releases.wikimedia.org or elsewhere) - https://phabricator.wikimedia.org/T150672#2792988 (10Dzahn) > 12:41 < subbu> could you add arlo and scott to that grou... [16:21:53] (03PS4) 10Dzahn: admins: create shell account for mepps, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/427944 (https://phabricator.wikimedia.org/T192472) [16:24:04] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#3561778 (10fgiunchedi) While investigating cronspam from recent reimages I took a look at mw1247 (for example) and noticed it has two disks but no software... [16:24:32] mutante moritzm ^ FYI [16:27:20] (03PS1) 10Urbanecm: Add all Hindi projects as import sources for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428952 (https://phabricator.wikimedia.org/T188366) [16:29:18] (03CR) 10Filippo Giunchedi: "This isn't yielding the result I want according to PCC, https://puppet-compiler.wmflabs.org/compiler02/11036/" [puppet] - 10https://gerrit.wikimedia.org/r/428947 (https://phabricator.wikimedia.org/T132324) (owner: 10Filippo Giunchedi) [16:30:39] godog: thanks! i'll check the ones i reinstalled. first one i have says "# / was on /dev/md1 during installation [16:32:25] sda (sda1, sda2), sdb (sdb1, sdb2), md0, md1 are all in /proc/partitions [16:32:44] mutante: ack, thanks! yeah only some hosts are affected iirc, there's a list in the task I linked [16:33:17] yep, i saw the list. i'll just go through the ones i reinstalled and check [16:34:10] sweet [16:35:14] (03PS2) 10Ladsgroup: Change fawiki's uca to the right one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428951 [16:36:23] (03PS5) 10Elukey: cassandra: add percentile metrics to 2.x's prometheus jmx config [puppet] - 10https://gerrit.wikimedia.org/r/428931 (https://phabricator.wikimedia.org/T193017) [16:36:51] (03CR) 10Elukey: [C: 032] cassandra: add percentile metrics to 2.x's prometheus jmx config [puppet] - 10https://gerrit.wikimedia.org/r/428931 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [16:39:57] (03PS5) 10Dzahn: admins: create shell account for mepps, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/427944 (https://phabricator.wikimedia.org/T192472) [16:40:45] (03CR) 10Dzahn: [C: 032] admins: create shell account for mepps, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/427944 (https://phabricator.wikimedia.org/T192472) (owner: 10Dzahn) [16:41:19] (03PS1) 10Urbanecm: Fix pixelization of new wiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428953 (https://phabricator.wikimedia.org/T193028) [16:44:15] gehel: merge done, aqs looks good from what I can see, I'll let you do maps or do you want me to ? [16:45:42] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4158586 (10Dzahn) Hi @mepps Your user has been created now and you are in the requested group. On one of the bastion hosts: [bast1002:~] $ id mepps uid=16947... [16:45:54] (03PS2) 10Urbanecm: Add all Hindi projects plus meta as import sources for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428952 (https://phabricator.wikimedia.org/T188366) [16:46:00] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4158587 (10Dzahn) 05Open>03Resolved [16:46:48] elukey: I'll do map around 8pm CEST if that's early enough for you [16:47:22] elukey: and thanks for the cleanup! [16:47:29] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4158589 (10Dzahn) [16:52:18] (03PS2) 10Muehlenhoff: Reimage mwdebug servers with stretch [puppet] - 10https://gerrit.wikimedia.org/r/428923 (https://phabricator.wikimedia.org/T174431) [16:53:48] PROBLEM - Host wdqs1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:03] gehel: good for me, I might not be around but if you ping me on hangouts I'll join [16:56:01] wdqs1004 is actually running [16:56:04] networking went down? [16:56:21] can't SSH into it, trying admin console [16:56:39] the admin console tells me it's running [16:56:52] at login [16:57:17] doesn't mean i can login though [16:57:40] mutante: since you're already there, can you check if you can login and what state the network is in? [16:57:48] gehel: i can't login [16:57:54] (03CR) 10Muehlenhoff: [C: 032] Reimage mwdebug servers with stretch [puppet] - 10https://gerrit.wikimedia.org/r/428923 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [16:58:08] mutante: ok, not good :/ powercycle? [16:58:16] sure, cycling it [17:00:01] !log powercycling wdqs1004 [17:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180425T1700). [17:00:04] subbu and Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:05] gehel: that was me [17:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:21] i am working in the rack and accidently pulled your network cable [17:00:42] o/ [17:01:06] cmjohnson1: ok, no problem! Good to know it's minor! [17:01:07] o/ [17:02:19] PROBLEM - nutcracker port on mw2258 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:02:19] PROBLEM - HHVM processes on mw2258 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:03:31] ACKNOWLEDGEMENT - HHVM processes on mw2258 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn reinstall [17:03:31] ACKNOWLEDGEMENT - HHVM rendering on mw2258 is CRITICAL: connect to address 10.192.16.57 and port 80: Connection refused daniel_zahn reinstall [17:03:31] ACKNOWLEDGEMENT - nutcracker port on mw2258 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn reinstall [17:03:31] ACKNOWLEDGEMENT - nutcracker process on mw2258 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. daniel_zahn reinstall [17:03:58] PROBLEM - Apache HTTP on mw2163 is CRITICAL: connect to address 10.192.32.51 and port 80: Connection refused [17:04:08] PROBLEM - nutcracker process on mw2165 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:04:08] PROBLEM - HHVM processes on mw2165 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:04:28] PROBLEM - nutcracker port on mw2165 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [17:04:40] oh come on, yes [17:04:40] I can SWAT [17:04:59] RECOVERY - HHVM processes on mw2165 is OK: PROCS OK: 6 processes with command name hhvm [17:05:08] ACKNOWLEDGEMENT - Apache HTTP on mw2163 is CRITICAL: connect to address 10.192.32.51 and port 80: Connection refused daniel_zahn reinstall [17:05:08] ACKNOWLEDGEMENT - Nginx local proxy to apache on mw2163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.167 second response time daniel_zahn reinstall [17:05:26] gehel: looks like wdqs1004 is dead, could you take a look? [17:05:34] (03PS2) 10Thcipriani: Enable RemexHtml on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427181 (https://phabricator.wikimedia.org/T192301) (owner: 10Subramanya Sastry) [17:05:39] PROBLEM - HHVM rendering on mw2165 is CRITICAL: connect to address 10.192.32.53 and port 80: Connection refused [17:05:41] SMalyshev: yep, cable issue, coming back [17:05:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427181 (https://phabricator.wikimedia.org/T192301) (owner: 10Subramanya Sastry) [17:06:04] gehel: cool, thanks! [17:06:31] cmjohnson1: can you ping me when the cable is back in place so I can check things work fine? [17:06:34] Amir1, or anyone else swatting? [17:06:51] thcipriani: is SWATing [17:06:52] gehel it's been back in place..I immediately put it back as soon as I realized my mistake [17:06:56] let me check it [17:07:00] mine is not testable [17:07:06] cmjohnson1: I still can't SSH... [17:07:10] we need to run updateCollation though [17:07:13] (03Merged) 10jenkins-bot: Enable RemexHtml on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427181 (https://phabricator.wikimedia.org/T192301) (owner: 10Subramanya Sastry) [17:07:50] (03PS3) 10Muehlenhoff: Don't include mediawiki::multimedia on labweb* [puppet] - 10https://gerrit.wikimedia.org/r/428298 [17:07:57] subbu: RemexHtml on frwikiquote is on mwdebug1002, check please [17:07:59] RECOVERY - Apache HTTP on mw2163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.154 second response time [17:08:12] thanks. testing. [17:08:30] (03CR) 10Krinkle: [C: 031] graphite: allow data requests from performance.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428836 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [17:08:40] thcipriani, looks good. [17:09:14] subbu: okie doke, going live [17:09:25] k [17:09:51] (03PS2) 10Arturo Borrero Gonzalez: labs_bootstrapvz: remove /var/lib/puppet/ssl in firstboot.sh script [puppet] - 10https://gerrit.wikimedia.org/r/428694 (https://phabricator.wikimedia.org/T181523) [17:11:08] RECOVERY - nutcracker process on mw2165 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [17:11:08] RECOVERY - Host wdqs1004 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [17:11:17] cmjohnson1: ok, wdqs1004 is back, I can SSH [17:11:28] RECOVERY - nutcracker port on mw2165 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [17:11:29] sorry about that [17:11:30] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:427181|Enable RemexHtml on frwikiquote]] T192301 (duration: 01m 17s) [17:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:37] T192301: Enable RemexHTML on frwikiquote - https://phabricator.wikimedia.org/T192301 [17:11:42] subbu: ^ live everywhere [17:11:49] RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 73606 bytes in 8.565 second response time [17:11:50] \o/ k [17:11:53] thakns. [17:11:55] *thanks [17:12:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428951 (owner: 10Ladsgroup) [17:12:35] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4158687 (10Cmjohnson) [17:12:53] Amir1: are you going to run updateCollation after your change is deployed? Or do you need me to? [17:13:10] (03CR) 10jenkins-bot: Enable RemexHtml on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427181 (https://phabricator.wikimedia.org/T192301) (owner: 10Subramanya Sastry) [17:13:17] thcipriani: I will do it, probably tomorrow [17:13:28] PROBLEM - WDQS HTTP on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 80: Connection refused [17:13:28] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: connect to address 10.64.0.17 and port 80: Connection refused [17:13:29] PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused [17:13:29] PROBLEM - Check whether ferm is active by checking the default input chain on wdqs1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:13:29] PROBLEM - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:13:33] Amir1: ok [17:14:09] ^ wdqs1004 is just getting back up, should be good in a minute, but I'm checking [17:14:26] (03CR) 10Thcipriani: Change fawiki's uca to the right one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428951 (owner: 10Ladsgroup) [17:14:28] PROBLEM - puppet last run on wdqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:14:29] (03PS3) 10Thcipriani: Change fawiki's uca to the right one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428951 (owner: 10Ladsgroup) [17:14:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428951 (owner: 10Ladsgroup) [17:15:28] RECOVERY - WDQS HTTP on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 16541 bytes in 0.001 second response time [17:15:28] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 16541 bytes in 0.001 second response time [17:15:29] RECOVERY - WDQS HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.023 second response time [17:15:55] (03Merged) 10jenkins-bot: Change fawiki's uca to the right one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428951 (owner: 10Ladsgroup) [17:16:28] RECOVERY - Check whether ferm is active by checking the default input chain on wdqs1004 is OK: OK ferm input default policy is set [17:16:28] RECOVERY - Check systemd state on wdqs1004 is OK: OK - running: The system is fully operational [17:16:53] Amir1: nothing to test, correct? [17:17:03] yeah [17:17:14] also tested it in other Persian Wikis before [17:17:54] ok, going live [17:18:13] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#3561778 (10Dzahn) I checked and all the mw22* are getting RAID due to this: mw22*) echo partman/mw-raid1.cfg ;; \ But mw216* hosts like mw2163, 2... [17:18:18] RECOVERY - puppet last run on wdqs1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:20:04] Thanks [17:20:08] (03CR) 10jenkins-bot: Change fawiki's uca to the right one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428951 (owner: 10Ladsgroup) [17:20:38] RECOVERY - HHVM processes on mw2258 is OK: PROCS OK: 6 processes with command name hhvm [17:20:42] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:428951|Change fawiki uca to the right one]] (duration: 01m 17s) [17:20:47] ^ Amir1 live everywhere [17:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:17] (03PS1) 10Dzahn: install_server: let mw21[6-9] have software RAID [puppet] - 10https://gerrit.wikimedia.org/r/428961 (https://phabricator.wikimedia.org/T174431) [17:22:31] (03PS2) 10Dzahn: install_server: let mw21[6-9] have software RAID [puppet] - 10https://gerrit.wikimedia.org/r/428961 (https://phabricator.wikimedia.org/T174431) [17:23:55] (03PS3) 10Dzahn: install_server: let mw21[6-9][0-9] have software RAID [puppet] - 10https://gerrit.wikimedia.org/r/428961 (https://phabricator.wikimedia.org/T174431) [17:30:54] (03CR) 10Imarlier: graphite: allow data requests from performance.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428836 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [17:35:20] !log starting cleanups on row 'a' Cassandra nodes -- T189822 [17:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:26] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [17:37:48] RECOVERY - nutcracker port on mw2258 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [17:40:31] (03Abandoned) 10Muehlenhoff: Inline role::mediawiki::scaler [puppet] - 10https://gerrit.wikimedia.org/r/428295 (owner: 10Muehlenhoff) [17:50:44] thcipriani: Thank you! [17:50:52] yw :) [17:52:38] (03PS1) 10Ottomata: Blacklist job|change-prop proper mirror maker instance for main -> analytics [puppet] - 10https://gerrit.wikimedia.org/r/428962 [17:53:25] (03CR) 10Ottomata: [C: 032] Blacklist job|change-prop proper mirror maker instance for main -> analytics [puppet] - 10https://gerrit.wikimedia.org/r/428962 (owner: 10Ottomata) [17:53:57] moritzm: merging your mwdebug stretch change [17:56:23] (03PS1) 10Ottomata: Add add_ip6_mapped to kafka100* [puppet] - 10https://gerrit.wikimedia.org/r/428963 (https://phabricator.wikimedia.org/T192832) [17:56:41] (03PS2) 10Ottomata: Add add_ip6_mapped to kafka100* [puppet] - 10https://gerrit.wikimedia.org/r/428963 (https://phabricator.wikimedia.org/T192832) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180425T1800) [18:06:54] (03PS1) 10Niharika29: Graduate CodeMirror from beta on 2017 Wikitext Editor for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) [18:08:15] (03CR) 10Herron: [C: 032] coal: Point systemd and uwsgi config to scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [18:08:20] (03PS5) 10Herron: coal: Point systemd and uwsgi config to scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/428659 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [18:20:31] PROBLEM - Kafka Broker Replica Max Lag on kafka1003 is CRITICAL: 5.028e+05 ge 5e+05 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1003 [18:20:42] (03PS2) 10Niharika29: Enable CodeMirror on RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) [18:21:21] PROBLEM - Kafka Broker Replica Max Lag on kafka1001 is CRITICAL: 5.059e+05 ge 5e+05 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka1001 [18:24:37] (03PS1) 10Ppchelko: Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) [18:25:25] (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [18:28:02] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [18:28:12] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [18:28:58] (03PS2) 10Ppchelko: Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) [18:30:11] (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [18:30:17] (03PS3) 10Ppchelko: Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) [18:31:28] (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [18:32:40] (03PS4) 10Ppchelko: Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) [18:33:52] (03CR) 10jerkins-bot: [V: 04-1] Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [18:36:51] (03PS5) 10Ppchelko: Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) [18:41:12] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [18:41:22] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [18:44:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [18:44:22] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [18:47:13] (03PS3) 10Ottomata: Add add_ip6_mapped to kafka100* [puppet] - 10https://gerrit.wikimedia.org/r/428963 (https://phabricator.wikimedia.org/T192832) [18:47:17] (03CR) 10Ottomata: [V: 032 C: 032] Add add_ip6_mapped to kafka100* [puppet] - 10https://gerrit.wikimedia.org/r/428963 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [18:49:18] (03PS6) 10Ppchelko: Disable Redis queue for most of jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428972 (https://phabricator.wikimedia.org/T190327) [18:49:22] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [18:49:31] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [18:52:21] 10Operations, 10Wikispeech, 10Wikispeech-WMSE: TTS server deployment strategy - https://phabricator.wikimedia.org/T193072#4159055 (10Reedy) [18:55:45] !log imarlier@tin Started deploy [performance/coal@1e79c79]: deploy fix for coal-web [18:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:51] !log imarlier@tin Finished deploy [performance/coal@1e79c79]: deploy fix for coal-web (duration: 00m 06s) [18:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] no_justification: #bothumor I � Unicode. All rise for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180425T1900). [19:01:28] (03PS1) 10Imarlier: coal: remove files that aren't needed any longer [puppet] - 10https://gerrit.wikimedia.org/r/428980 (https://phabricator.wikimedia.org/T191994) [19:05:41] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [19:06:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [19:09:47] !log otto@tin Started deploy [eventlogging/eventbus@f562c1b]: Fix for logging error https://gerrit.wikimedia.org/r/#/c/428982/ [19:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:13] !log otto@tin Started deploy [eventlogging/eventbus@f0783bb]: Fix for logging error https://gerrit.wikimedia.org/r/#/c/428982/ [19:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:23] !log otto@tin Finished deploy [eventlogging/eventbus@f0783bb]: Fix for logging error https://gerrit.wikimedia.org/r/#/c/428982/ (duration: 00m 11s) [19:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:44] !log altering timeline tables for 6 month TTL -- T192689 [19:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:50] T192689: Unchecked storage growth(?) - https://phabricator.wikimedia.org/T192689 [19:15:59] (03CR) 10Jforrester: [C: 031] "Good to go once 1.32.0-wmf.1 is everywhere (so, Thursday evening SWAT)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) (owner: 10Niharika29) [19:16:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [19:16:42] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [19:18:23] Would anyone be able to merge https://gerrit.wikimedia.org/r/#/c/428836/ for me? [19:20:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [19:20:51] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [19:21:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [19:21:51] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [19:28:54] (03CR) 10Ottomata: [C: 032] Add IPv6 entries for kafka[12]00[123] [dns] - 10https://gerrit.wikimedia.org/r/428926 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [19:29:43] looking :) [19:31:10] marlier: did you want to do what timo said and remove http support? [19:32:00] Yeah, I can pull that out. Give me one second. [19:33:09] (03PS2) 10Imarlier: graphite: allow data requests from performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/428836 (https://phabricator.wikimedia.org/T191994) [19:33:24] ottomata: Just pushed that up, will take a couple of minutes for the tests to run, I assume. [19:33:50] (03CR) 10Niharika29: "Thanks James. It's scheduled for tomorrow evening." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428968 (https://phabricator.wikimedia.org/T191923) (owner: 10Niharika29) [19:34:01] (03CR) 10Hashar: "I guess we would need to carry some other packages as well:" [puppet] - 10https://gerrit.wikimedia.org/r/428314 (owner: 10Muehlenhoff) [19:37:17] !log otto@tin Started deploy [eventlogging/eventbus@f0783bb]: Fix for logging error https://gerrit.wikimedia.org/r/#/c/428982/ [19:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:03] !log otto@tin Finished deploy [eventlogging/eventbus@f0783bb]: Fix for logging error https://gerrit.wikimedia.org/r/#/c/428982/ (duration: 01m 45s) [19:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:25] (03CR) 10Ottomata: [C: 032] graphite: allow data requests from performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/428836 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [19:39:41] merged marlier [19:39:56] ottomata: sweet, thanks! [19:49:51] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4159243 (10ayounsi) a:05ayounsi>03None [19:53:40] 10Operations, 10hardware-requests: Eqiad: hardware request for 2 HP D3600 Enclosures - https://phabricator.wikimedia.org/T193079#4159250 (10chasemp) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: How many deployers does it take to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180425T2000). [20:00:37] no parsoid deploy today [20:09:33] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4159296 (10ayounsi) [20:11:16] Nothing for ORES. [20:11:36] (03PS1) 10Ottomata: Enable eventbus Kafka producer snappy compression [puppet] - 10https://gerrit.wikimedia.org/r/429007 (https://phabricator.wikimedia.org/T193080) [20:11:57] !log bsitzmann@tin Started deploy [mobileapps/deploy@5a4a282]: Config: Start up to 4 workers in parallel during start-up [20:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:34] (03PS1) 10Ayounsi: Ping offload: remove test VIP [puppet] - 10https://gerrit.wikimedia.org/r/429012 (https://phabricator.wikimedia.org/T190090) [20:18:45] !log bsitzmann@tin Finished deploy [mobileapps/deploy@5a4a282]: Config: Start up to 4 workers in parallel during start-up (duration: 06m 48s) [20:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:33] (03PS1) 10Ayounsi: Ping offload: remove test VIP from DNS [dns] - 10https://gerrit.wikimedia.org/r/429013 (https://phabricator.wikimedia.org/T190090) [20:19:56] (03CR) 10Ayounsi: [C: 032] Ping offload: remove test VIP [puppet] - 10https://gerrit.wikimedia.org/r/429012 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [20:21:30] !log remove test VIP for eqiad ping offload server - T190090 [20:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:36] T190090: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 [20:22:14] (03CR) 10Ayounsi: [C: 032] Ping offload: remove test VIP from DNS [dns] - 10https://gerrit.wikimedia.org/r/429013 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [20:29:40] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4159322 (10ayounsi) [20:34:00] (03PS1) 10MaxSem: Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) [20:34:24] (03CR) 10jerkins-bot: [V: 04-1] Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) (owner: 10MaxSem) [20:41:46] (03PS1) 10Chad: group1 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429096 [20:42:47] 10Operations, 10Cloud-VPS, 10hardware-requests: Codfw: (1) hardware access request for labtestnet1001 refresh - https://phabricator.wikimedia.org/T193081#4159335 (10chasemp) p:05Triage>03Normal [20:44:04] 10Operations, 10Cloud-VPS, 10hardware-requests: eqiad: (2) systems for labvirt expansion (labvirt1023 & labvirt1024) - https://phabricator.wikimedia.org/T192119#4159357 (10chasemp) p:05Triage>03Normal a:03RobH [20:53:34] (03PS2) 10MaxSem: Redeploy ArticleCreationWorkflow, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429017 (https://phabricator.wikimedia.org/T192455) [20:53:50] Krinkle: i just saw you edited the commit message on https://gerrit.wikimedia.org/r/#/c/392030/ a couple days ago. does this mean we are ready for that though? [20:55:48] (03PS1) 10Ayounsi: Assign IP for ping2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/429099 (https://phabricator.wikimedia.org/T190090) [20:57:54] (03CR) 10Chad: [C: 032] group1 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429096 (owner: 10Chad) [20:57:58] (03PS1) 10MaxSem: Redeploy ArticleCreationWorkflow, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429100 (https://phabricator.wikimedia.org/T192455) [20:58:33] !log on tin: rebased php-1.31.0-wmf.30 for https://gerrit.wikimedia.org/r/#/c/429018/ [20:58:35] (03CR) 10Ayounsi: [C: 032] Assign IP for ping2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/429099 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [20:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:21] (03Merged) 10jenkins-bot: group1 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429096 (owner: 10Chad) [20:59:45] (03PS1) 10Eevans: cassandra: increase `vm.max_map_count` to 1048575 [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) [21:00:37] (03PS2) 10Eevans: cassandra: increase `vm.max_map_count` to 1048575 [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) [21:00:42] (03CR) 10jenkins-bot: group1 to 1.32.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429096 (owner: 10Chad) [21:01:17] !log demon@tin Synchronized php: symlink bump (duration: 01m 16s) [21:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:55] !log demon@tin rebuilt and synchronized wikiversions files: group1 to wmf.1 [21:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:07] (03CR) 10Eevans: "I'm not certain this is needed on the other clusters, *BUT*, I'm also not certain that would hurt to be applied on the other clusters. Be" [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans) [21:06:55] mutante: No, it is not ready. [21:06:58] still -1 [21:07:34] we'll probably do it a different way, and move navtiming/coal separately, and also remove it from hafnium at the same time. [21:08:11] also, possibly moving to webperf1001 before adding webperf1002 to make sure everything is clean and fine on the new host, separetly from multi-dc concerns. [21:12:32] PROBLEM - mediawiki-installation DSH group on mw2163 is CRITICAL: Host mw2163 is not in mediawiki-installation dsh group [21:12:48] (03CR) 10Eevans: "PC output: http://puppet-compiler.wmflabs.org/11039" [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans) [21:15:31] PROBLEM - mediawiki-installation DSH group on mw2164 is CRITICAL: Host mw2164 is not in mediawiki-installation dsh group [21:16:59] Krinkle: ok, thanks. it sounds like i should just abandon it then. [21:17:21] mw2163 and mw2164 will be reinstalled a second time in order to get RAID they should have [21:17:21] mutante: That's fine yeah, we might re-open it at some point, or use it as starting point. [21:17:30] Thanks though! [21:18:06] (03Abandoned) 10Dzahn: webperf1001/2001 start using webperf role [puppet] - 10https://gerrit.wikimedia.org/r/392030 (https://phabricator.wikimedia.org/T186774) (owner: 10Dzahn) [21:18:26] you're welcome, no worries [21:19:31] PROBLEM - mediawiki-installation DSH group on mw2165 is CRITICAL: Host mw2165 is not in mediawiki-installation dsh group [21:19:47] (03PS1) 10Ayounsi: Ping offload, dhcp, partman and puppet for ping2001 [puppet] - 10https://gerrit.wikimedia.org/r/429106 (https://phabricator.wikimedia.org/T190090) [21:20:57] (03CR) 10Ayounsi: [C: 032] Ping offload, dhcp, partman and puppet for ping2001 [puppet] - 10https://gerrit.wikimedia.org/r/429106 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [21:33:00] (03CR) 10Imarlier: "@Aaron Still worth looking at this?" [debs/dynomite] - 10https://gerrit.wikimedia.org/r/421447 (owner: 10Aaron Schulz) [21:33:37] (03PS4) 10Dzahn: install_server: let mw21[6-9][0-9] have software RAID [puppet] - 10https://gerrit.wikimedia.org/r/428961 (https://phabricator.wikimedia.org/T174431) [21:34:30] (03CR) 10Dzahn: [C: 032] install_server: let mw21[6-9][0-9] have software RAID [puppet] - 10https://gerrit.wikimedia.org/r/428961 (https://phabricator.wikimedia.org/T174431) (owner: 10Dzahn) [21:54:56] (03PS1) 10Dzahn: base: update version of gen_fingerprints script [puppet] - 10https://gerrit.wikimedia.org/r/429114 [22:00:04] samwilson and MaxSem: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GlobalPreferences test deployment . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180425T2200). [22:05:31] (03PS2) 10Samwilson: Deploy GlobalPreferences to test wikis and mw.org (third time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428554 [22:07:13] (03CR) 10Samwilson: [C: 032] Deploy GlobalPreferences to test wikis and mw.org (third time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428554 (owner: 10Samwilson) [22:08:29] (03Merged) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org (third time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428554 (owner: 10Samwilson) [22:09:34] (03CR) 10jenkins-bot: Deploy GlobalPreferences to test wikis and mw.org (third time) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428554 (owner: 10Samwilson) [22:09:42] (03CR) 10Mobrovac: "While I agree that no harm would come from increasing the map count overall, we can also modify the value in RB's role Hiera keeping the o" [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans) [22:15:22] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4159636 (10ayounsi) [22:19:31] PROBLEM - mediawiki-installation DSH group on mw2258 is CRITICAL: Host mw2258 is not in mediawiki-installation dsh group [22:21:54] !log samwilson@tin Synchronized wmf-config/InitialiseSettings.php: Deploy GlobalPreferences T189806 (duration: 01m 18s) [22:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:01] T189806: Deploy GlobalPrefs on production - https://phabricator.wikimedia.org/T189806 [22:32:48] (03PS1) 10Samwilson: Revert "Deploy GlobalPreferences to test wikis and mw.org (third time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429119 [22:51:47] (03CR) 10MaxSem: [C: 032] Revert "Deploy GlobalPreferences to test wikis and mw.org (third time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429119 (owner: 10Samwilson) [22:53:05] (03Merged) 10jenkins-bot: Revert "Deploy GlobalPreferences to test wikis and mw.org (third time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429119 (owner: 10Samwilson) [22:53:20] (03CR) 10jenkins-bot: Revert "Deploy GlobalPreferences to test wikis and mw.org (third time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429119 (owner: 10Samwilson) [22:55:45] !log samwilson@tin Synchronized wmf-config/InitialiseSettings.php: Undeploy GlobalPreferences T184121 (duration: 01m 16s) [22:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:52] T184121: Deploy checklist for GlobalPreferences on production - https://phabricator.wikimedia.org/T184121 [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180425T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:17:47] (03CR) 10Aaron Schulz: "We still need to redo session storage (currently redis) to be more multi-dc robust (specfically with regard to server failures and re-shar" [debs/dynomite] - 10https://gerrit.wikimedia.org/r/421447 (owner: 10Aaron Schulz) [23:19:32] RECOVERY - mediawiki-installation DSH group on mw2258 is OK: OK [23:31:48] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=75%) [23:33:50] jouncebot: now [23:33:50] For the next 0 hour(s) and 26 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180425T2300)