[00:43:35] (03PS35) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [00:46:37] (03PS36) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [00:48:04] (03PS1) 10Dereckson: Use extension registration for LabeledSectionTransclusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281237 (https://phabricator.wikimedia.org/T119117) [00:50:11] (03PS37) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [00:52:40] (03PS1) 10Dereckson: Use extension registration for SpamBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281239 (https://phabricator.wikimedia.org/T119117) [00:57:28] (03PS38) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [01:02:18] (03PS1) 10Dereckson: Use extension registration for TitleBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281240 (https://phabricator.wikimedia.org/T119117) [01:10:47] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:57] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:07] PROBLEM - dhclient process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:07] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:26] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:26] PROBLEM - salt-minion processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:59] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:07] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:26] PROBLEM - configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:26] PROBLEM - Check size of conntrack table on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:38] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:56] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:56] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:13:52] (03PS1) 10Dereckson: Use extension registration for Quiz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281242 (https://phabricator.wikimedia.org/T119117) [01:14:38] (03PS1) 10Dereckson: Use extension registration for FundraisingTranslateWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281243 (https://phabricator.wikimedia.org/T119117) [01:17:17] (03PS1) 10Dereckson: Use extension registration for Gadgets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281244 (https://phabricator.wikimedia.org/T119117) [01:26:38] (03PS39) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [01:35:57] RECOVERY - Disk space on mw1114 is OK: DISK OK [01:36:27] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [01:36:38] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [01:36:38] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [01:36:47] RECOVERY - DPKG on mw1114 is OK: All packages OK [01:39:35] (03PS40) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [01:41:49] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:42:27] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:42:46] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:42:46] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:44:36] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [01:44:37] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [01:44:37] RECOVERY - DPKG on mw1114 is OK: All packages OK [01:44:57] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [01:44:57] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [01:46:51] (03PS41) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [01:48:58] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:48:58] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [01:49:39] RECOVERY - Disk space on mw1114 is OK: DISK OK [01:50:51] (03PS1) 10Dereckson: Use extension registration for MwEmbedSupport [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281246 (https://phabricator.wikimedia.org/T119117) [01:53:47] RECOVERY - configured eth on mw1114 is OK: OK - interfaces up [01:53:48] RECOVERY - Check size of conntrack table on mw1114 is OK: OK: nf_conntrack is 0 % full [01:54:54] (03PS42) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [01:55:29] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [02:21:17] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [02:34:07] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:34:07] PROBLEM - salt-minion processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:36] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:47] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:06] PROBLEM - configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:06] PROBLEM - Check size of conntrack table on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:18] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:36] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:37] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:47] PROBLEM - dhclient process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:47] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:19] RECOVERY - Disk space on mw1114 is OK: DISK OK [02:38:57] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [02:39:17] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [02:39:26] RECOVERY - DPKG on mw1114 is OK: All packages OK [02:39:26] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [02:39:26] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [02:39:47] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [02:39:47] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:40:37] RECOVERY - configured eth on mw1114 is OK: OK - interfaces up [02:40:46] RECOVERY - Check size of conntrack table on mw1114 is OK: OK: nf_conntrack is 0 % full [02:44:17] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [02:49:57] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:16] PROBLEM - configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:16] PROBLEM - Check size of conntrack table on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:37] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:56] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:56] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:58] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:58] PROBLEM - dhclient process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:16] PROBLEM - salt-minion processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:16] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:38] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:14:35] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.19) (duration: 56m 43s) [03:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:46] RECOVERY - Check size of conntrack table on mw1114 is OK: OK: nf_conntrack is 0 % full [03:14:46] RECOVERY - configured eth on mw1114 is OK: OK - interfaces up [03:15:07] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [03:15:26] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [03:15:26] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [03:15:26] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [03:15:27] RECOVERY - DPKG on mw1114 is OK: All packages OK [03:15:37] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:15:37] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:16:06] RECOVERY - Disk space on mw1114 is OK: DISK OK [03:16:26] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [03:22:26] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:22:26] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:22:37] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:22:37] PROBLEM - dhclient process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:22:47] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:27] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:38] PROBLEM - configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:38] PROBLEM - Check size of conntrack table on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:24:06] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:24:36] PROBLEM - salt-minion processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:24:56] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:46:16] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [03:46:17] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [03:46:38] RECOVERY - configured eth on mw1114 is OK: OK - interfaces up [03:46:46] RECOVERY - Check size of conntrack table on mw1114 is OK: OK: nf_conntrack is 0 % full [03:47:06] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [03:47:08] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [03:47:16] RECOVERY - DPKG on mw1114 is OK: All packages OK [03:47:17] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [03:47:17] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [03:47:26] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:47:26] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [03:47:47] RECOVERY - Disk space on mw1114 is OK: DISK OK [03:48:07] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [03:52:38] PROBLEM - dhclient process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:38] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:47] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:52:47] PROBLEM - salt-minion processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:53:16] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:53:28] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:53:46] PROBLEM - Check size of conntrack table on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:53:47] PROBLEM - configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:54:07] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:54:26] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:54:26] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:01:21] (03PS43) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (https://phabricator.wikimedia.org/T130404) [04:09:26] PROBLEM - RAID on db1052 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [04:18:26] RECOVERY - configured eth on mw1114 is OK: OK - interfaces up [04:18:26] RECOVERY - Check size of conntrack table on mw1114 is OK: OK: nf_conntrack is 0 % full [04:18:46] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [04:18:57] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [04:18:57] RECOVERY - DPKG on mw1114 is OK: All packages OK [04:18:58] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [04:18:58] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [04:19:07] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:19:07] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [04:19:36] RECOVERY - Disk space on mw1114 is OK: DISK OK [04:23:46] PROBLEM - Check size of conntrack table on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:46] PROBLEM - configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:07] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:27] PROBLEM - dhclient process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:27] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:37] PROBLEM - salt-minion processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:37] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:24:57] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:08] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:08] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:33:26] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [04:38:47] PROBLEM - dhclient process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:45:27] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: puppet fail [04:50:06] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [04:50:07] RECOVERY - Check size of conntrack table on mw1114 is OK: OK: nf_conntrack is 0 % full [04:50:07] RECOVERY - configured eth on mw1114 is OK: OK - interfaces up [04:50:37] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [04:50:48] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [04:50:48] RECOVERY - DPKG on mw1114 is OK: All packages OK [04:50:57] RECOVERY - dhclient process on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [04:50:57] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [04:50:57] RECOVERY - salt-minion processes on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:50:57] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [04:51:26] RECOVERY - Disk space on mw1114 is OK: DISK OK [05:13:37] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:29:37] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:47] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [06:30:27] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:28] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:47] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: puppet fail [06:31:06] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:37] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:38] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 3 failures [06:34:38] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:46] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:57] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:57] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [06:49:59] (03PS1) 1001tonythomas: Add Newsletter extension to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281249 (https://phabricator.wikimedia.org/T127297) [06:55:37] (03CR) 10Addshore: [C: 031] Add Newsletter extension to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281249 (https://phabricator.wikimedia.org/T127297) (owner: 1001tonythomas) [06:55:57] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:06] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:47] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:57] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:58] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:37] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:33:08] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: puppet fail [08:01:26] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:32] ACKNOWLEDGEMENT - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist Faidon Liambotis Not set up yet [08:10:58] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.131 second response time [08:11:16] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 64613 bytes in 0.326 second response time [08:31:00] (03PS1) 10Jforrester: Enable VisualEditor on the Project ('Wikipedya') of htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281263 (https://phabricator.wikimedia.org/T130177) [08:37:57] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [08:52:05] (03PS1) 10Merlijn van Deen: Install python-numpy and python-pandas [puppet] - 10https://gerrit.wikimedia.org/r/281271 [08:52:29] YuviPanda: ^ [08:53:42] (03PS2) 10Yuvipanda: Install python-numpy and python-pandas [puppet] - 10https://gerrit.wikimedia.org/r/281271 (owner: 10Merlijn van Deen) [08:53:49] (03CR) 10Yuvipanda: [C: 032 V: 032] Install python-numpy and python-pandas [puppet] - 10https://gerrit.wikimedia.org/r/281271 (owner: 10Merlijn van Deen) [08:53:51] <3 [08:54:21] valhallasw`cloud: :D thanks! going to bed really soon tho [08:54:32] there's probably some opsen here that I can get to revert stuff [08:58:15] * YuviPanda nods [08:58:23] valhallasw`cloud: maybe one day we can get bd808 root [08:58:59] YuviPanda: heh. the last time I asked that was a very loud NO! but that was a couple of years ago [09:01:11] :) [09:03:28] puppet is horribly slow [09:03:38] or maybe it's just all of the exec hosts [09:03:42] or nfs, or all of them :{ [09:06:33] 6Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 7Tracking: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) (tracking) - https://phabricator.wikimedia.org/T10217#2174690 (10Aklapper) p:5High>3Normal The [[ https://www.... [09:14:39] (03PS3) 10Mobrovac: Scap3: chown the target root dir if owned by root [puppet] - 10https://gerrit.wikimedia.org/r/279415 [09:25:58] YuviPanda: what on earch. Puppet is still running. [09:26:28] (03PS4) 10Mobrovac: Scap3: chown the target root dir if owned by root [puppet] - 10https://gerrit.wikimedia.org/r/279415 [09:34:39] (03CR) 10Mobrovac: "https://puppet-compiler.wmflabs.org/2284/ is happy" [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac) [10:14:18] (03PS2) 10Jforrester: Enable VisualEditor Beta Feature on Wikisources, Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280828 [10:14:20] (03PS1) 10Jforrester: Enable VisualEditor Beta Feature on Labs enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281283 [10:15:08] RoanKattouw: ^^ [10:15:29] (03CR) 10Catrope: [C: 032] Enable VisualEditor Beta Feature on Labs enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281283 (owner: 10Jforrester) [10:16:12] (03Merged) 10jenkins-bot: Enable VisualEditor Beta Feature on Labs enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281283 (owner: 10Jforrester) [10:24:17] (03PS15) 10Mobrovac: Kafka config: Add config functions [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) [10:26:28] (03PS5) 10Mobrovac: Scap3: chown the target root dir if owned by root [puppet] - 10https://gerrit.wikimedia.org/r/279415 [10:27:21] (03CR) 10Mobrovac: "@Ottomata, done in PS15" [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [10:32:36] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [10:33:06] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [10:33:28] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 25.93% of data above the critical threshold [100000000.0] [10:37:43] RoanKattouw: ^^^^ [10:37:53] Oh, oops, sorry [10:37:54] Will fix [10:39:37] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [10:57:15] 6Operations, 10ops-eqiad: db1052 degraded RAID - https://phabricator.wikimedia.org/T131701#2175325 (10Volans) [10:58:08] ACKNOWLEDGEMENT - RAID on db1052 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Volans T131701 [10:58:27] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [11:22:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:26:28] (03PS1) 10Ori.livneh: Load the Newsletter extension on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281288 [11:29:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:31:30] (03CR) 10Ori.livneh: [C: 032] "beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281288 (owner: 10Ori.livneh) [11:31:55] (03Merged) 10jenkins-bot: Load the Newsletter extension on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281288 (owner: 10Ori.livneh) [11:33:57] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [11:34:08] !log ori@tin Synchronized wmf-config/CommonSettings-labs.php: I3ffe65b8: Load the Newsletter extension on the beta cluster (1/2) (duration: 00m 34s) [11:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:34:42] !log ori@tin Synchronized wmf-config/InitialiseSettings-labs.php: I3ffe65b8: Load the Newsletter extension on the beta cluster (2/2) (duration: 00m 33s) [11:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:04] 6Operations, 10ops-codfw, 6DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2062646 (10Volans) @Papaul @RobH: any news on this? In particular for **db2017** (failed), **db2018** (failed) and **db2023** (predicted failure), that are masters in codfw, it would be better t... [11:35:16] (03PS1) 10Ori.livneh: Also add Newsletter to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281291 [11:35:39] (03CR) 10Ori.livneh: [C: 032] Also add Newsletter to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281291 (owner: 10Ori.livneh) [11:36:11] (03Merged) 10jenkins-bot: Also add Newsletter to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281291 (owner: 10Ori.livneh) [11:37:02] !log ori@tin Synchronized wmf-config/extension-list-labs: I0d081186: Also add Newsletter to extension-list-labs (duration: 00m 27s) [11:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:45:07] PROBLEM - cassandra-a service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [11:45:26] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [11:45:47] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: Connection refused [12:02:57] RECOVERY - Disk space on restbase2004 is OK: DISK OK [12:02:57] RECOVERY - cassandra-a service on restbase2004 is OK: OK - cassandra-a is active [12:03:17] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [12:03:37] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.040 second response time on port 9042 [12:10:37] 6Operations, 6Project-Admins: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2175693 (10Aklapper) 5Open>3Resolved Resolving as per last comment. [12:14:41] 6Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 7Tracking: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) (tracking) - https://phabricator.wikimedia.org/T10217#2175710 (10Kaihsu) It has not suddenly become urgent. For th... [12:29:00] (03PS1) 10Dereckson: Use extension registration for ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281303 (https://phabricator.wikimedia.org/T119117) [12:35:25] 6Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 7Tracking: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) (tracking) - https://phabricator.wikimedia.org/T10217#2175793 (10Liuxinyu970226) >>! In T10217#2175710, @Kaihsu wr... [12:49:34] 6Operations, 10ops-eqiad: db1052 degraded RAID - https://phabricator.wikimedia.org/T131701#2175871 (10Cmjohnson) a:3Cmjohnson [12:50:09] (03PS1) 10Jforrester: Let Beta Cluster Commons do upload-from-URL from production Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281307 [12:53:24] (03CR) 10Reedy: "Do we need to set wgCopyUploadProxy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281307 (owner: 10Jforrester) [12:58:57] (03CR) 10Jforrester: "I don't know." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281307 (owner: 10Jforrester) [13:01:02] (03CR) 10Reedy: [C: 032] "One way to find out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281307 (owner: 10Jforrester) [13:01:13] | [13:01:13] * Reedy reserves the right to say "I told you so" [13:01:26] (03Merged) 10jenkins-bot: Let Beta Cluster Commons do upload-from-URL from production Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281307 (owner: 10Jforrester) [13:01:48] (03PS1) 10Dereckson: Use extension registration for GlobalBlocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281311 (https://phabricator.wikimedia.org/T119117) [13:02:36] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: labs copy upload thing. noooooop for prod (duration: 00m 28s) [13:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:07:23] Reedy: Uh-huh. [13:08:46] Reedy: "Copy uploads are not available from this domain." [13:08:55] Whitelist? [13:09:08] Oh [13:09:12] proxy being stupid? [13:09:21] Probably. [13:11:43] James_F: I think it's explicitly blacklisted [13:11:52] Hello, can someone tell me, which name the coursecoordinator group has? [13:12:00] I need it for a patch [13:14:36] Reedy: But it's specified for testwiki in prod… [13:14:43] Luke081515: ep-coordinator the group [13:14:45] * James_F checks if it works there or is just bitrot. [13:15:01] Luke081515: see https://www.mediawiki.org/wiki/Extension:Education_Program/Preferences#Course_coordinators [13:15:05] thanks [13:16:32] Reedy: It works in prod. https://test.wikipedia.org/wiki/File:VisualEditor-logo.svg just now… [13:16:51] James_F: See if flickr type thing works? [13:16:57] Kk. [13:17:01] Just to see if it's not just internal stuff [13:19:38] Gah. Slow Internet is slow. [13:20:39] (03PS1) 10Luke081515: Two permission changes at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281314 (https://phabricator.wikimedia.org/T131684) [13:20:42] Reedy: Flickr via UploadWizard seemed to work fine (it errored about the licence, which suggests it got it). [13:21:57] Mmm [13:22:48] Beta Cluster is magic. [13:25:27] +testwiki' => array( 'upload.wikimedia.org' ), [13:25:57] '+commonswiki' => array( 'upload.wikimedia.org' ) [13:26:38] !log reedy@tin Synchronized wmf-config/InitialiseSettings-labs.php: labs copy upload thing. noooooop for prod (duration: 00m 33s) [13:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:26:58] Oh. Hah. [13:27:01] That might help. [13:27:05] Well, no it won't [13:27:09] That's prod not labs :P [13:27:14] But jenkins might not have done it yet [13:27:31] Yup, that works. [13:27:32] http://commons.wikimedia.beta.wmflabs.org/wiki/File:VisualEditor-logo.svg [13:27:41] Ha. Point. [13:27:44] It still now works. [13:27:45] :-D [13:28:09] haha [13:28:12] sweet [13:29:27] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [13:30:07] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [13:31:07] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [13:40:07] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:55:38] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:56:36] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:00:12] 6Operations, 10ops-eqiad: db1052 degraded RAID - https://phabricator.wikimedia.org/T131701#2176137 (10Volans) @Cmjohnson: given the role of this DB, please sync with me before performing the disk swap. Btw, do we have a spare available? [16:53:59] (03PS1) 10Urbanecm: Permission change at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 [16:59:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [17:00:16] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [17:13:47] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:14:27] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:26:22] 503 Service Temporarily Unavailable [18:26:26] https://www.mediawiki.org/wiki/File:Mediawiki-vagrant-screenshot.png [18:42:37] (03PS1) 10ArielGlenn: dumps: fix up one more directory reference for cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/281321 [18:44:08] (03CR) 10ArielGlenn: [C: 032] dumps: fix up one more directory reference for cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/281321 (owner: 10ArielGlenn) [19:08:59] Danny_B, WFM [19:09:07] PROBLEM - HHVM rendering on mw1146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.017 second response time [19:09:36] PROBLEM - Apache HTTP on mw1146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [19:13:01] what's up with mw1146? [19:13:08] Danny_B, okay, I see problems with at least one host [19:14:17] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 66551 bytes in 1.027 second response time [19:14:47] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.068 second response time [19:14:51] well... either me restarting apache2+hhvm did the trick, or that was a coincidence [19:16:05] Krenair: !log it [19:16:13] yeah I was writing my !log as you sent that [19:16:19] !log mw1146 began to respond with 503s to all requests, tried restarting apache2/hhvm and shortly afterwards it started working again [19:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:32] Ok :) [19:25:34] (03PS2) 10Luke081515: Permission change at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 (https://phabricator.wikimedia.org/T131037) (owner: 10Urbanecm) [19:26:31] (03CR) 10Luke081515: [C: 031] Permission change at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 (https://phabricator.wikimedia.org/T131037) (owner: 10Urbanecm) [19:57:03] (03Abandoned) 10Tim Landscheidt: Tools: Source python-socketio-client for Trusty from backports [puppet] - 10https://gerrit.wikimedia.org/r/238662 (https://phabricator.wikimedia.org/T91874) (owner: 10Tim Landscheidt) [20:30:21] (03PS1) 10ArielGlenn: delay dumps full run start by one more day this month [puppet] - 10https://gerrit.wikimedia.org/r/281323 [20:31:45] (03CR) 10ArielGlenn: [C: 032] delay dumps full run start by one more day this month [puppet] - 10https://gerrit.wikimedia.org/r/281323 (owner: 10ArielGlenn) [20:55:40] (03PS3) 10Dereckson: Permission change at test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 (https://phabricator.wikimedia.org/T131037) (owner: 10Urbanecm) [20:56:46] (03CR) 10Dereckson: [C: 031] "PS3: adding a reference to the task number to ease tracking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281318 (https://phabricator.wikimedia.org/T131037) (owner: 10Urbanecm) [21:03:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:17:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:15:27] Hello ops people! I have a request from an enwiki admin to delete https://en.wikipedia.org/wiki/User_talk:Essjay as part of a history merge; can the servers survive that? [22:15:43] it has 5,631 revisions at last count so it shouldn't be *too* strenuous I don't think [22:17:17] YuviPanda, are you the person to ask about this? Or some !keyword to get sysadmin attention? [22:18:37] there were database problems as recently as a few hours ago [22:19:23] it's almost certainly fine, I think you can go ahead. [22:19:30] awesome, thanks :) [22:24:22] deleted, nothing seems to be broken yet... [22:28:02] and undeleting... [22:32:56] it worked and the site is still up! hooray [22:33:00] thanks again :) [22:34:04] !log mwscript deleteEqualMessages.php --wiki eswiki [22:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:37:25] !log mwscript deleteEqualMessages.php --wiki eswiki (T45917) [22:37:26] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [22:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:37:29] !log mwscript deleteEqualMessages.php --wiki eswiki --lang-code ca (T45917) [22:37:30] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [22:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:36] 6Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#2176878 (10Krinkle) I'd like to offer an alternative to adding a Gemfile everywhere. Repositories should not have multiple test entry po... [23:05:22] (03PS1) 10Krinkle: Remove inaccessible symlinks at /w/extensions and /w/skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281379 (https://phabricator.wikimedia.org/T99096)