[00:23:30] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:27:30] PROBLEM - Swift HTTP backend on ms-fe2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:29:19] RECOVERY - Swift HTTP backend on ms-fe2002 is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 0.084 second response time [00:32:39] PROBLEM - cassandra CQL 10.64.32.160:9042 on restbase1004 is CRITICAL: Connection refused [00:45:50] PROBLEM - Check size of conntrack table on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:47:30] PROBLEM - Disk space on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:49:28] RECOVERY - Disk space on ms-be2019 is OK: DISK OK [00:49:30] RECOVERY - Check size of conntrack table on ms-be2019 is OK: OK: nf_conntrack is 0 % full [00:54:58] PROBLEM - DPKG on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:19] PROBLEM - Disk space on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:28] PROBLEM - configured eth on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:29] PROBLEM - Check size of conntrack table on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:58:58] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet has 54 failures [01:04:08] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:05:58] RECOVERY - Swift HTTP backend on ms-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 0.097 second response time [01:06:49] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 73994 MB (3% inode=99%) [01:16:29] PROBLEM - DPKG on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:16:40] PROBLEM - puppet last run on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:16:48] PROBLEM - RAID on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:16:58] PROBLEM - configured eth on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:17:10] PROBLEM - nutcracker process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:18:18] RECOVERY - DPKG on mw1002 is OK: All packages OK [01:18:29] RECOVERY - RAID on mw1002 is OK: OK: no RAID installed [01:18:30] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 10 minutes ago with 0 failures [01:18:40] RECOVERY - configured eth on mw1002 is OK: OK - interfaces up [01:18:58] RECOVERY - nutcracker process on mw1002 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:28:20] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:28:52] ACKNOWLEDGEMENT - cassandra CQL 10.64.32.160:9042 on restbase1004 is CRITICAL: Connection refused eevans Decommission complete This node will be down for the remaining of the weekend. - The acknowledgement expires at: 2015-12-21 01:28:02. [01:30:39] ACKNOWLEDGEMENT - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 62089 MB (3% inode=99%): gwicke Keeping an eye on this one. 62G left with a big compaction at 92% will likely make it. [01:30:49] PROBLEM - NTP on ms-be2019 is CRITICAL: NTP CRITICAL: No response from NTP server [02:18:59] RECOVERY - Disk space on restbase1008 is OK: DISK OK [02:21:55] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 08m 59s) [02:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:49] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Dec 20 02:28:49 UTC 2015 (duration 6m 54s) [02:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:52:49] PROBLEM - puppet last run on mw2142 is CRITICAL: CRITICAL: Puppet has 1 failures [04:05:59] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100000000.0] [04:18:19] RECOVERY - puppet last run on mw2142 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [04:44:39] (03PS1) 10Gergő Tisza: [DO NOT MERGE] Switch FlaggedRevs to "flagged protection" mode on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260224 (https://phabricator.wikimedia.org/T121995) [05:14:19] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [05:28:40] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet has 60 failures [05:33:59] PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: puppet fail [05:57:59] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:01:28] RECOVERY - puppet last run on mc2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:05:18] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:29:09] PROBLEM - nutcracker process on mw1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:19] PROBLEM - RAID on mw1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:28] PROBLEM - dhclient process on mw1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:29] PROBLEM - puppet last run on mw1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:48] PROBLEM - SSH on mw1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:58] PROBLEM - Disk space on mw1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:59] PROBLEM - salt-minion processes on mw1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:29] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:50] PROBLEM - DPKG on mw1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:58] PROBLEM - nutcracker port on mw1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:58] PROBLEM - configured eth on mw1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:09] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:39] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:49] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:29] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 4 failures [06:33:38] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:40] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 3 failures [06:34:59] RECOVERY - nutcracker port on mw1001 is OK: TCP OK - 0.000 second response time on port 11212 [06:35:09] RECOVERY - nutcracker process on mw1001 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:35:19] RECOVERY - dhclient process on mw1001 is OK: PROCS OK: 0 processes with command name dhclient [06:35:48] RECOVERY - SSH on mw1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [06:36:39] RECOVERY - DPKG on mw1001 is OK: All packages OK [06:36:40] RECOVERY - configured eth on mw1001 is OK: OK - interfaces up [06:37:09] RECOVERY - RAID on mw1001 is OK: OK: no RAID installed [06:37:48] RECOVERY - Disk space on mw1001 is OK: DISK OK [06:37:49] RECOVERY - salt-minion processes on mw1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:54:58] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:19] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:57:00] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:09] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:57:59] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:30] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:00] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: puppet fail [07:32:19] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:51:29] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 1 failures [08:16:58] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:17:19] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: Puppet has 1 failures [08:44:39] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:00:21] !log powercycle ms-be2019, xfs lockup [09:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:03:28] RECOVERY - configured eth on ms-be2019 is OK: OK - interfaces up [09:03:28] RECOVERY - swift-container-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:03:28] RECOVERY - Disk space on ms-be2019 is OK: DISK OK [09:03:29] RECOVERY - Check size of conntrack table on ms-be2019 is OK: OK: nf_conntrack is 0 % full [09:03:29] RECOVERY - swift-account-server on ms-be2019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:03:29] RECOVERY - swift-account-reaper on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:03:29] RECOVERY - swift-account-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:03:59] RECOVERY - RAID on ms-be2019 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [09:04:09] RECOVERY - swift-object-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:04:39] RECOVERY - swift-container-server on ms-be2019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:04:39] RECOVERY - swift-container-updater on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:04:39] RECOVERY - swift-account-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:04:39] RECOVERY - very high load average likely xfs on ms-be2019 is OK: OK - load average: 9.48, 4.07, 1.52 [09:04:39] RECOVERY - swift-object-server on ms-be2019 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:04:39] RECOVERY - dhclient process on ms-be2019 is OK: PROCS OK: 0 processes with command name dhclient [09:04:39] RECOVERY - swift-object-updater on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:04:40] RECOVERY - swift-object-auditor on ms-be2019 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:05:10] RECOVERY - SSH on ms-be2019 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [09:05:10] RECOVERY - swift-container-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:05:10] RECOVERY - DPKG on ms-be2019 is OK: All packages OK [09:05:10] RECOVERY - salt-minion processes on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:52:13] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1893351 (10Nemo_bis) [10:35:08] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 870 [10:40:08] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1000 [10:45:09] RECOVERY - check_mysql on db1008 is OK: Uptime: 154151 Threads: 95 Questions: 8263215 Slow queries: 1648 Opens: 2061 Flush tables: 2 Open tables: 400 Queries per second avg: 53.604 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:46:59] (03PS1) 10Southparkfan: Correctly order MediaWiki servers in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/260232 [12:54:19] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: puppet fail [13:21:40] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:49] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection timed out [13:29:19] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [13:30:49] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2016-06-30 17:56:02 +0000 (expires in 193 days) [13:43:01] 6operations, 10Traffic, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1893571 (10Aklapper) [13:54:40] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:50:59] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:58] PROBLEM - puppet last run on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:54:39] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 65334 bytes in 0.128 second response time [14:54:39] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [14:58:49] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:59:39] PROBLEM - nutcracker process on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:59:39] PROBLEM - RAID on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:59:59] PROBLEM - DPKG on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:59:59] PROBLEM - HHVM processes on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:00:00] PROBLEM - salt-minion processes on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:00:09] PROBLEM - Check size of conntrack table on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:00:39] PROBLEM - configured eth on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:00:49] PROBLEM - puppet last run on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:00:58] PROBLEM - Disk space on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:08] PROBLEM - nutcracker port on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:18] PROBLEM - Apache HTTP on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:38] RECOVERY - RAID on mw1228 is OK: OK: no RAID installed [15:01:38] RECOVERY - nutcracker process on mw1228 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:01:58] RECOVERY - HHVM processes on mw1228 is OK: PROCS OK: 6 processes with command name hhvm [15:01:59] RECOVERY - DPKG on mw1228 is OK: All packages OK [15:01:59] RECOVERY - salt-minion processes on mw1228 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:02:10] RECOVERY - Check size of conntrack table on mw1228 is OK: OK: nf_conntrack is 0 % full [15:02:38] RECOVERY - configured eth on mw1228 is OK: OK - interfaces up [15:02:40] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [15:02:49] RECOVERY - Disk space on mw1228 is OK: DISK OK [15:02:59] RECOVERY - nutcracker port on mw1228 is OK: TCP OK - 0.000 second response time on port 11212 [15:08:04] did everything just die? https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [15:08:38] bblack, mark ^ [15:08:53] yurik: like what die? [15:08:58] aude, https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [15:09:03] all status codes went to 0 [15:09:26] but it seems that wiki is still ok, so my guess that the reporting stuff died somehow [15:09:35] aggregation died [15:09:37] ? [15:11:04] i guess so [15:13:38] PROBLEM - HHVM processes on mw1228 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [15:14:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [15:15:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:23:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:38:56] (03PS1) 10Reedy: Remove wgArticlePath from InitialiseSettings as it's in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260242 [15:41:30] !log reedy@tin Purged l10n cache for 1.27.0-wmf.7 [15:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:06] !log mw1228 reporting readonly fs [15:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:54] 6operations: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1893636 (10Reedy) 3NEW [15:45:46] 6operations: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1893651 (10Reedy) ``` The authenticity of host 'mw1228.eqiad.wmnet ()' can't be established. ECDSA key fingerprint is SHA256:rdCU2vs6Jctc96R4kDXnIdrhl0DaKizk8ctz0vTcr2M. Are you sure you wa... [15:46:54] 6operations: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1893652 (10Reedy) ``` [14:50:59] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:06] --> govg (~govg@unaffiliated/govg) has joined #wikimedia-operations [14... [15:50:20] !log reedy@tin Purged l10n cache for 1.27.0-wmf.6 (hanging due to mw1228 issue) [15:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:50:46] Wonder what versions of MW can be deleted [15:51:08] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:51:34] Reedy: Everything beyond 1.5 was bloat... so... *hides* [15:51:53] hoo: I mean from tin :P [15:53:50] !log reedy@tin Synchronized README: noop (duration: 00m 32s) [15:53:53] Didn't we have that (maybe not 100% serious) page which asked to downgrade Wikipedia to 1.5 (or something along these lines)? [15:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:24] I think we've had a few [15:54:25] :D [16:15:09] PROBLEM - Disk space on restbase1003 is CRITICAL: DISK CRITICAL - free space: /var 110978 MB (3% inode=99%) [16:21:35] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1893687 (10Southparkfan) mw1061, mw1083, mw1161 and mw1169 are showing (normal) activity again, so that's good. mw1118, mw1141 and mw1196 still seem idle, and mw1228 is idle per a (likely) broken disk (T122005) s... [16:26:49] PROBLEM - Disk space on restbase1003 is CRITICAL: DISK CRITICAL - free space: /var 111502 MB (3% inode=99%) [16:54:02] (03CR) 10Hoo man: [C: 04-1] "No longer relevant (code should be developed in operations/dumps/dcat now)" [puppet] - 10https://gerrit.wikimedia.org/r/251492 (https://phabricator.wikimedia.org/T117533) (owner: 10Lokal Profil) [16:54:15] (03CR) 10Hoo man: [C: 04-1] "No longer relevant (code should be developed in operations/dumps/dcat now)" [puppet] - 10https://gerrit.wikimedia.org/r/251493 (owner: 10Lokal Profil) [17:00:44] (03PS3) 10Hoo man: snapshot: mv wikidatadumps classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/260186 (owner: 10Dzahn) [17:10:48] PROBLEM - HTTPS on magnesium is CRITICAL: SSL CRITICAL - Certificate rt.wikimedia.org valid until 2016-01-09 09:48:57 +0000 (expires in 19 days) [17:11:38] !log depool mw1228, reported ro fs [17:11:41] Reedy: ^ [17:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:04] (03PS1) 10Hoo man: snapshot: Deploy DCAT from operations/dumps/dcat [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) [17:16:41] (03CR) 10Hoo man: "Note: I reviewed the changes between the version deployed before and the master of operations/dumps/dcat." [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) (owner: 10Hoo man) [17:24:34] godog: Thanks! [17:24:45] I filed a task about it as ssh is down, so that should be good for now [17:25:07] (03PS5) 10Andrew Bogott: WIP: Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/259787 [17:25:58] godog: Wonder if it should go from the dsh lists for now too [17:32:08] 6operations: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1893735 (10Reedy) godog has now depooled it. Still needs investigation [17:33:24] Reedy: yeah good idea [17:33:46] are the dsh lists made dynamically now? [17:34:01] quick glance and I couldn't see it in the puppet repo [17:34:48] yeah it is in mediawiki-installation isn't it? [17:35:02] yeah, that's the list [17:36:13] (03PS1) 10Filippo Giunchedi: scap: mw1228 reported ro fs [puppet] - 10https://gerrit.wikimedia.org/r/260251 (https://phabricator.wikimedia.org/T122005) [17:36:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] scap: mw1228 reported ro fs [puppet] - 10https://gerrit.wikimedia.org/r/260251 (https://phabricator.wikimedia.org/T122005) (owner: 10Filippo Giunchedi) [17:36:58] {{done}} [17:47:38] thanks! :D [17:47:40] !log reedy@tin Purged l10n cache for 1.27.0-wmf.6 [17:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:59] (03PS5) 10Andrew Bogott: WIP: nova-network: have dnsmasq advertise the network host as a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [18:00:01] (03PS6) 10Andrew Bogott: Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/259787 [18:00:03] (03PS5) 10Andrew Bogott: Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 [18:13:18] (03PS6) 10Andrew Bogott: WIP: nova-network: have dnsmasq advertise the network host as a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [18:13:20] (03PS7) 10Andrew Bogott: Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/259787 [18:13:22] (03PS6) 10Andrew Bogott: Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 [18:16:29] PROBLEM - mediawiki-installation DSH group on mw1228 is CRITICAL: Host mw1228 is not in mediawiki-installation dsh group [18:20:38] hm, did I break jenkins? [18:31:21] !log restarting stuck Jenkins [18:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:25] godog: we broke icinga-wm :P [18:43:00] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/259787 (owner: 10Andrew Bogott) [18:46:14] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/260037 (owner: 10Andrew Bogott) [18:46:55] !log graceful restart of zuul as per https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Restart [18:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:48:52] (03CR) 10Andrew Bogott: [C: 032] Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/259787 (owner: 10Andrew Bogott) [18:53:04] (03PS1) 10Andrew Bogott: Revert "Set up special dhcp behavior for bare-metal boxes" [puppet] - 10https://gerrit.wikimedia.org/r/260252 [18:53:30] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: puppet fail [18:53:39] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: puppet fail [18:53:48] PROBLEM - puppet last run on restbase1002 is CRITICAL: CRITICAL: puppet fail [18:53:59] PROBLEM - puppet last run on db2001 is CRITICAL: CRITICAL: puppet fail [18:54:10] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: puppet fail [18:54:28] PROBLEM - puppet last run on wtp1013 is CRITICAL: CRITICAL: puppet fail [18:54:29] PROBLEM - puppet last run on wtp2009 is CRITICAL: CRITICAL: puppet fail [18:54:30] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: puppet fail [18:54:38] (03CR) 10Andrew Bogott: [C: 032] Revert "Set up special dhcp behavior for bare-metal boxes" [puppet] - 10https://gerrit.wikimedia.org/r/260252 (owner: 10Andrew Bogott) [18:54:40] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: puppet fail [18:54:49] PROBLEM - puppet last run on wtp1002 is CRITICAL: CRITICAL: puppet fail [18:54:49] PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: puppet fail [18:54:50] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: puppet fail [18:54:58] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: puppet fail [18:54:59] PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: puppet fail [18:55:09] PROBLEM - puppet last run on mw2027 is CRITICAL: CRITICAL: puppet fail [18:55:09] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: puppet fail [18:55:10] PROBLEM - puppet last run on mw2068 is CRITICAL: CRITICAL: puppet fail [18:55:30] PROBLEM - puppet last run on db1010 is CRITICAL: CRITICAL: puppet fail [18:55:38] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: puppet fail [18:55:38] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: puppet fail [18:55:39] PROBLEM - puppet last run on magnesium is CRITICAL: CRITICAL: puppet fail [18:55:49] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: puppet fail [18:55:55] hm? [18:55:58] PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: puppet fail [18:56:00] PROBLEM - puppet last run on mw2026 is CRITICAL: CRITICAL: puppet fail [18:56:19] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: puppet fail [18:56:28] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: puppet fail [18:56:29] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: puppet fail [18:56:31] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: puppet fail [18:56:31] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: puppet fail [18:56:38] PROBLEM - puppet last run on mw1045 is CRITICAL: CRITICAL: puppet fail [18:56:39] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: puppet fail [18:56:48] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [18:56:49] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: puppet fail [18:56:58] PROBLEM - puppet last run on graphite1002 is CRITICAL: CRITICAL: puppet fail [18:56:59] PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: puppet fail [18:57:08] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: puppet fail [18:57:08] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: puppet fail [18:57:18] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: puppet fail [18:57:18] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [18:57:18] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [18:57:19] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: puppet fail [18:57:19] PROBLEM - puppet last run on wdqs1002 is CRITICAL: CRITICAL: puppet fail [18:57:28] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: puppet fail [18:57:39] PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: puppet fail [18:57:39] PROBLEM - puppet last run on mw2057 is CRITICAL: CRITICAL: puppet fail [18:57:41] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: puppet fail [18:57:41] PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: puppet fail [18:57:48] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [18:57:49] PROBLEM - puppet last run on mw1093 is CRITICAL: CRITICAL: puppet fail [18:57:58] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: puppet fail [18:57:58] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: puppet fail [18:57:58] PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: puppet fail [18:57:59] PROBLEM - puppet last run on es2005 is CRITICAL: CRITICAL: puppet fail [18:57:59] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: puppet fail [18:58:00] PROBLEM - puppet last run on ganeti2003 is CRITICAL: CRITICAL: puppet fail [18:58:00] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: puppet fail [18:58:08] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: puppet fail [18:58:08] PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: puppet fail [18:58:10] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: puppet fail [18:58:10] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: puppet fail [18:58:18] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: puppet fail [18:58:18] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: puppet fail [18:58:19] PROBLEM - puppet last run on mc1018 is CRITICAL: CRITICAL: puppet fail [18:58:29] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: puppet fail [18:58:30] PROBLEM - puppet last run on mc2001 is CRITICAL: CRITICAL: puppet fail [18:58:31] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: puppet fail [18:58:31] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: puppet fail [18:58:31] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: puppet fail [18:58:31] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: puppet fail [18:58:31] PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: puppet fail [18:58:32] PROBLEM - puppet last run on elastic1012 is CRITICAL: CRITICAL: puppet fail [18:58:32] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: puppet fail [18:58:32] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:58:38] PROBLEM - puppet last run on mw1090 is CRITICAL: CRITICAL: puppet fail [18:58:39] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: puppet fail [18:58:39] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: puppet fail [18:58:39] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: puppet fail [18:58:59] PROBLEM - puppet last run on mc1002 is CRITICAL: CRITICAL: puppet fail [18:58:59] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: puppet fail [18:58:59] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: puppet fail [18:59:08] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: puppet fail [18:59:18] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [18:59:19] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [18:59:29] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: puppet fail [18:59:39] PROBLEM - puppet last run on sca1002 is CRITICAL: CRITICAL: puppet fail [18:59:39] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: puppet fail [18:59:39] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: puppet fail [18:59:49] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: puppet fail [18:59:49] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: puppet fail [18:59:59] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: puppet fail [18:59:59] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: puppet fail [19:00:00] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: puppet fail [19:00:08] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: puppet fail [19:00:09] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: puppet fail [19:00:18] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [19:00:19] PROBLEM - puppet last run on wtp1010 is CRITICAL: CRITICAL: puppet fail [19:00:19] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: puppet fail [19:00:28] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: puppet fail [19:00:29] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: puppet fail [19:00:38] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: puppet fail [19:00:39] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: puppet fail [19:01:00] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: puppet fail [19:01:10] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [19:01:10] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: puppet fail [19:01:10] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: puppet fail [19:01:18] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: puppet fail [19:01:19] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: puppet fail [19:01:28] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [19:01:29] Has apache died on the puppetmaster again? [19:01:59] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: puppet fail [19:03:08] Reedy: could be, although config-master.wm.o still seems up [19:03:14] Reedy: whatever it is seems to have passed [19:03:40] Reedy: oh, I know what it is — I merged a syntax error in hiera and then reverted. [19:03:43] Jenkins didn’t catch it, oddly [19:04:18] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:17:39] RECOVERY - puppet last run on restbase1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:19:58] RECOVERY - puppet last run on db2001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:20:38] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:20:40] RECOVERY - puppet last run on wtp1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:20:40] RECOVERY - puppet last run on wdqs1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:20:49] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:21:19] RECOVERY - puppet last run on db1010 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:21:28] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:21:29] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:22:08] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:22:18] RECOVERY - puppet last run on wtp1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:22:19] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:22:19] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:22:20] RECOVERY - puppet last run on wtp2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:22:48] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:22:49] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:22:58] RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:22:59] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:08] RECOVERY - puppet last run on mw2027 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [19:23:09] RECOVERY - puppet last run on mw2068 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [19:23:09] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:09] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:10] RECOVERY - puppet last run on wdqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:19] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:28] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:29] RECOVERY - puppet last run on magnesium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:29] RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:23:29] RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:23:38] RECOVERY - puppet last run on mw2057 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:23:40] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:49] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:23:49] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:23:50] RECOVERY - puppet last run on es2005 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:23:58] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:23:59] RECOVERY - puppet last run on mw2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:59] RECOVERY - puppet last run on ganeti2003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:23:59] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:08] RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:24:09] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:24:09] RECOVERY - puppet last run on mc1018 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:24:19] RECOVERY - puppet last run on mc2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:24:19] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:24:20] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:24:28] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:24:28] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:29] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:38] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:40] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:24:40] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:49] RECOVERY - puppet last run on mc1002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:24:49] RECOVERY - puppet last run on graphite1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:24:58] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:58] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:24:59] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:25:09] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:09] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:10] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:25:10] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:25:10] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:19] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:25:28] RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:25:39] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:25:40] RECOVERY - puppet last run on mw1093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:48] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:25:49] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:49] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:49] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:50] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [19:25:59] RECOVERY - puppet last run on rdb2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:59] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:00] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:00] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [19:26:09] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:18] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [19:26:18] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:26:19] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:19] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:20] RECOVERY - puppet last run on elastic1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:28] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:26:28] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:28] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:26:28] RECOVERY - puppet last run on mw1090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:38] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:26:38] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:26:49] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [19:26:50] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:26:50] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:59] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:27:09] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:27:29] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:27:38] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:27:38] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:27:40] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:27:49] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:27:59] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:28:00] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:28:09] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:28:19] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:28:19] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:28:29] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:28:38] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:28:49] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [19:28:50] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:28:58] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [19:29:00] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [19:29:49] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:59:17] (03PS7) 10Andrew Bogott: WIP: nova-network: have dnsmasq advertise the network host as a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [19:59:19] (03PS7) 10Andrew Bogott: Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 [19:59:21] (03PS1) 10Andrew Bogott: Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/260253 [20:01:11] (03PS8) 10Andrew Bogott: WIP: nova-network: have dnsmasq advertise the network host as a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [20:01:13] (03PS2) 10Andrew Bogott: Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/260253 [20:01:15] (03PS8) 10Andrew Bogott: Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 [20:01:17] (03CR) 10jenkins-bot: [V: 04-1] Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/260253 (owner: 10Andrew Bogott) [20:01:44] (03CR) 10jenkins-bot: [V: 04-1] WIP: nova-network: have dnsmasq advertise the network host as a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 (owner: 10Andrew Bogott) [20:03:02] (03CR) 10Andrew Bogott: [C: 032] Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/260253 (owner: 10Andrew Bogott) [20:25:59] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [20:29:33] (03CR) 10Alex Monk: [C: 031] "seems fine to me on deployment-bastion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260242 (owner: 10Reedy) [20:30:24] (03CR) 10Reedy: "In theory, it should be removed from CommonSettings, maybe. But there'll be a reason it's in both (it didn't work in one?)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260242 (owner: 10Reedy) [20:34:39] RECOVERY - RAID on dataset1001 is OK: OK: optimal, 3 logical, 36 physical [21:19:59] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [21:20:00] PROBLEM - PyBal backends health check on lvs1008 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1001.eqiad.wmnet because of too many down! [21:23:49] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [21:23:58] RECOVERY - PyBal backends health check on lvs1008 is OK: PYBAL OK - All pools are healthy [21:56:40] PROBLEM - Disk space on restbase1003 is CRITICAL: DISK CRITICAL - free space: /var 111411 MB (3% inode=99%) [22:00:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:05:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:10:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:15:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:16:15] frack [22:16:35] codfw [22:20:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:25:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:30:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:35:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:40:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:45:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:50:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [22:55:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:00:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:05:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:10:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:15:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:20:10] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:24:25] !log Katie and Jeff paged about bellatrix [23:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:07] keeping an eye on the restbase node's disk space [23:25:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:25:41] 6operations, 10fundraising-tech-ops: bellatrix raid predictive failure - https://phabricator.wikimedia.org/T122026#1894061 (10Reedy) 3NEW [23:28:18] PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: puppet fail [23:30:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:35:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:40:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:45:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:50:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:55:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [23:55:39] RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures