[00:07:16] (03PS2) 10Tim Starling: Move idiosyncratic gdbinit to /home/ori [puppet] - 10https://gerrit.wikimedia.org/r/176307 [00:08:37] (03CR) 10Tim Starling: "Updated commit message since the previous commit message was elliptical." [puppet] - 10https://gerrit.wikimedia.org/r/176307 (owner: 10Tim Starling) [00:25:23] (03CR) 10Ori.livneh: [C: 031] "You'll need to either amend the change to ensure => absent the global file, or you'll need to clean it up manually. You can clean it up ma" [puppet] - 10https://gerrit.wikimedia.org/r/176307 (owner: 10Tim Starling) [00:49:52] (03PS2) 10Tim Starling: xhprof production profiling hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174372 [00:50:24] (03CR) 10Tim Starling: [C: 032] xhprof production profiling hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174372 (owner: 10Tim Starling) [00:50:35] (03Merged) 10jenkins-bot: xhprof production profiling hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174372 (owner: 10Tim Starling) [00:51:54] !log tstarling Synchronized wmf-config/StartProfiler.php: (no message) (duration: 00m 05s) [00:52:00] Logged the message, Master [00:54:37] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: /srv 18587 MB (3% inode=97%): [00:55:56] PROBLEM - HTTPS on antimony is CRITICAL: SSL_CERT CRITICAL svn.wikimedia.org: certificate will expire on Jan 31 10:53:05 2015 GMT [00:57:03] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 578 seconds [00:58:52] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 677 seconds [01:03:18] PROBLEM - HHVM busy threads on mw1232 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:04:37] PROBLEM - HHVM busy threads on mw1222 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [01:05:37] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:06:59] PROBLEM - HHVM busy threads on mw1235 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [01:07:06] PROBLEM - HHVM busy threads on mw1233 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [01:07:06] PROBLEM - HHVM busy threads on mw1223 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [01:08:06] PROBLEM - HHVM busy threads on mw1227 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [115.2] [01:08:56] RECOVERY - HHVM busy threads on mw1232 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:05] RECOVERY - HHVM busy threads on mw1235 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:05] RECOVERY - HHVM busy threads on mw1233 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:16] RECOVERY - HHVM busy threads on mw1222 is OK: OK: Less than 1.00% above the threshold [76.8] [01:10:17] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [01:11:24] RECOVERY - HHVM busy threads on mw1230 is OK: OK: Less than 1.00% above the threshold [76.8] [01:11:26] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [01:12:45] RECOVERY - HHVM busy threads on mw1223 is OK: OK: Less than 1.00% above the threshold [76.8] [01:13:57] RECOVERY - HHVM busy threads on mw1227 is OK: OK: Less than 1.00% above the threshold [76.8] [01:32:18] (03PS1) 10Ori.livneh: Provision HHVM source tree in /usr/src instead of /usr/local/src [puppet] - 10https://gerrit.wikimedia.org/r/176624 [01:33:48] (03CR) 10Ori.livneh: "Follow-up change: Id1d40d5cb" [puppet] - 10https://gerrit.wikimedia.org/r/176307 (owner: 10Tim Starling) [01:49:18] (03CR) 10Aude: [C: 031] Extra language names configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610 (owner: 10Dereckson) [02:10:30] !log l10nupdate Synchronized php-1.25wmf9/cache/l10n: (no message) (duration: 00m 01s) [02:10:34] !log LocalisationUpdate completed (1.25wmf9) at 2014-12-01 02:10:33+00:00 [02:10:40] Logged the message, Master [02:10:42] Logged the message, Master [02:17:53] !log l10nupdate Synchronized php-1.25wmf10/cache/l10n: (no message) (duration: 00m 03s) [02:17:56] Logged the message, Master [02:17:56] !log LocalisationUpdate completed (1.25wmf10) at 2014-12-01 02:17:56+00:00 [02:17:59] Logged the message, Master [03:34:57] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Dec 1 03:34:57 UTC 2014 (duration 34m 56s) [03:35:04] Logged the message, Master [03:52:14] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [03:54:54] RECOVERY - HHVM busy threads on mw1230 is OK: OK: Less than 1.00% above the threshold [76.8] [04:06:04] (03CR) 10KartikMistry: [C: 031] Extra language names configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176610 (owner: 10Dereckson) [04:24:57] (03PS6) 10KartikMistry: Add ContentTranslation in wikishared DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175979 [04:26:52] (03Abandoned) 10Ori.livneh: Add site-list.json for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174885 (owner: 10Ori.livneh) [04:36:14] _joe_: can we make sure all hhvm servers are running the same package today? i see that 38 are running 3.3.0+dfsg1-1+wm4 and another 21 are running 3.3.0+dfsg1-1+wm3.1 [04:44:23] !log tstarling Synchronized php-1.25wmf9/includes/parser/MWTidy.php: change previously pulled but scap was apparently not run (duration: 00m 06s) [04:44:25] Logged the message, Master [04:46:26] !log tstarling Synchronized php-1.25wmf10/includes/parser/MWTidy.php: change previously pulled but scap was apparently not run (duration: 00m 05s) [04:46:28] Logged the message, Master [05:28:33] (03Abandoned) 10Tim Landscheidt: Set up redirects for toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [05:31:36] (03Abandoned) 10Tim Landscheidt: Script to test Toolserver redirects [software] - 10https://gerrit.wikimedia.org/r/108467 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [05:43:23] <_joe_> ori: yep [05:43:41] <_joe_> the older ones are running wm3.1 probably, [06:34:10] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: puppet fail [06:34:39] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:40] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:59] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:02] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [06:35:38] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [06:43:41] PROBLEM - puppet last run on db1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:01] PROBLEM - puppet last run on mw1037 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:34] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:48] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:02] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:05] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:14] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:49:03] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:35] RECOVERY - puppet last run on db1006 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:52] RECOVERY - puppet last run on mw1037 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:57] (03PS1) 10Legoktm: extdist: Add composer location to config [puppet] - 10https://gerrit.wikimedia.org/r/176631 [07:17:54] (03CR) 10Legoktm: "I26b1a2710d25a90a71d9bebd86e3c447793d8567 is the labs/tools/extdist change to use this." [puppet] - 10https://gerrit.wikimedia.org/r/176631 (owner: 10Legoktm) [09:14:57] (03PS3) 10Giuseppe Lavagetto: hiera: role-based backend, role keyword [puppet] - 10https://gerrit.wikimedia.org/r/176334 [09:22:44] (03CR) 10Hashar: [C: 031] Remove -hhvm suffix from beta multiversion config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173486 (owner: 10Reedy) [09:30:56] <_joe_> !log upgrading hhvm to the latest version across the cluster [09:31:05] Logged the message, Master [09:55:47] <_joe_> !log reimaging mw1033-mw1040 to HHVM, depooling from the main pool now [09:55:51] Logged the message, Master [10:07:58] PROBLEM - Host mw1233 is DOWN: PING CRITICAL - Packet loss = 100% [10:08:29] PROBLEM - Host mw1234 is DOWN: PING CRITICAL - Packet loss = 100% [10:08:32] RECOVERY - Host mw1233 is UP: PING WARNING - Packet loss = 50%, RTA = 1.04 ms [10:10:13] RECOVERY - Host mw1234 is UP: PING OK - Packet loss = 0%, RTA = 2.01 ms [10:10:19] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 1 failures [10:11:59] PROBLEM - nutcracker process on mw1233 is CRITICAL: Connection refused by host [10:12:09] <_joe_> and I scheduled downtime... [10:12:24] PROBLEM - check configured eth on mw1233 is CRITICAL: Connection refused by host [10:12:39] PROBLEM - HHVM processes on mw1233 is CRITICAL: Connection refused by host [10:12:39] PROBLEM - nutcracker port on mw1233 is CRITICAL: Connection refused by host [10:14:29] PROBLEM - check if salt-minion is running on mw1233 is CRITICAL: Connection refused by host [10:14:32] PROBLEM - DPKG on mw1233 is CRITICAL: Connection refused by host [10:15:22] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 1 failures [10:16:10] PROBLEM - Disk space on mw1233 is CRITICAL: Connection refused by host [10:16:19] PROBLEM - RAID on mw1233 is CRITICAL: Connection refused by host [10:16:20] PROBLEM - check if dhclient is running on mw1233 is CRITICAL: Connection refused by host [10:16:41] PROBLEM - check if salt-minion is running on mw1234 is CRITICAL: Connection refused by host [10:16:42] PROBLEM - Apache HTTP on mw1233 is CRITICAL: Connection refused [10:16:45] PROBLEM - Disk space on mw1234 is CRITICAL: Connection refused by host [10:16:46] PROBLEM - HHVM processes on mw1235 is CRITICAL: Connection refused by host [10:16:52] PROBLEM - Host mw1236 is DOWN: PING CRITICAL - Packet loss = 100% [10:16:53] PROBLEM - puppet last run on mw1233 is CRITICAL: Connection refused by host [10:17:00] PROBLEM - HHVM rendering on mw1233 is CRITICAL: Connection refused [10:17:12] PROBLEM - HHVM rendering on mw1234 is CRITICAL: Connection refused [10:17:12] RECOVERY - Host mw1236 is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [10:17:12] PROBLEM - SSH on mw1233 is CRITICAL: Connection refused [10:17:20] PROBLEM - Apache HTTP on mw1234 is CRITICAL: Connection refused [10:17:52] PROBLEM - check configured eth on mw1234 is CRITICAL: Connection refused by host [10:17:53] PROBLEM - nutcracker process on mw1234 is CRITICAL: Connection refused by host [10:17:53] PROBLEM - DPKG on mw1234 is CRITICAL: Connection refused by host [10:18:18] PROBLEM - SSH on mw1234 is CRITICAL: Connection refused [10:18:25] PROBLEM - RAID on mw1234 is CRITICAL: Connection refused by host [10:18:25] PROBLEM - nutcracker port on mw1234 is CRITICAL: Connection refused by host [10:18:25] PROBLEM - HHVM processes on mw1234 is CRITICAL: Connection refused by host [10:18:46] PROBLEM - puppet last run on mw1234 is CRITICAL: Connection refused by host [10:19:17] PROBLEM - check if dhclient is running on mw1234 is CRITICAL: Connection refused by host [10:19:55] PROBLEM - HHVM busy threads on mw1229 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [10:20:08] PROBLEM - HHVM rendering on mw1235 is CRITICAL: Connection refused [10:20:08] RECOVERY - SSH on mw1233 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [10:20:18] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 1 failures [10:20:35] PROBLEM - HHVM processes on mw1236 is CRITICAL: Connection refused by host [10:20:36] PROBLEM - RAID on mw1235 is CRITICAL: Connection refused by host [10:20:49] PROBLEM - nutcracker port on mw1235 is CRITICAL: Connection refused by host [10:20:49] PROBLEM - check configured eth on mw1235 is CRITICAL: Connection refused by host [10:21:06] PROBLEM - check if salt-minion is running on mw1236 is CRITICAL: Connection refused by host [10:21:11] PROBLEM - SSH on mw1235 is CRITICAL: Connection refused [10:21:11] PROBLEM - puppet last run on mw1235 is CRITICAL: Connection refused by host [10:21:19] PROBLEM - HHVM busy threads on mw1227 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [10:22:22] PROBLEM - HHVM busy threads on mw1224 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [10:23:00] PROBLEM - HHVM queue size on mw1224 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [80.0] [10:23:00] PROBLEM - HHVM busy threads on mw1223 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [115.2] [10:23:12] PROBLEM - HHVM busy threads on mw1222 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [115.2] [10:23:53] PROBLEM - HHVM queue size on mw1222 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [80.0] [10:24:04] RECOVERY - HHVM busy threads on mw1229 is OK: OK: Less than 1.00% above the threshold [76.8] [10:25:19] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 1 failures [10:25:44] RECOVERY - HHVM queue size on mw1224 is OK: OK: Less than 1.00% above the threshold [10.0] [10:25:45] RECOVERY - HHVM busy threads on mw1227 is OK: OK: Less than 1.00% above the threshold [76.8] [10:26:25] RECOVERY - HHVM queue size on mw1222 is OK: OK: Less than 1.00% above the threshold [10.0] [10:28:35] RECOVERY - HHVM busy threads on mw1223 is OK: OK: Less than 1.00% above the threshold [76.8] [10:28:54] RECOVERY - HHVM busy threads on mw1222 is OK: OK: Less than 1.00% above the threshold [76.8] [10:30:21] RECOVERY - HHVM busy threads on mw1224 is OK: OK: Less than 1.00% above the threshold [76.8] [10:30:25] RECOVERY - check_puppetrun on db1008 is OK: OK: Puppet is currently enabled, last run 121 seconds ago with 0 failures [10:46:07] PROBLEM - DPKG on mw1233 is CRITICAL: Connection refused by host [10:46:07] PROBLEM - puppet last run on mw1233 is CRITICAL: Connection refused by host [10:46:46] PROBLEM - Disk space on mw1233 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:49:33] RECOVERY - Disk space on mw1233 is OK: DISK OK [10:54:25] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 112 failures [10:56:36] PROBLEM - Host mw1233 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:49] _joe_: next time schedule a downtime :D [10:57:20] RECOVERY - DPKG on mw1233 is OK: All packages OK [10:57:23] <_joe_> matanya: I did... [10:57:26] RECOVERY - Host mw1233 is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [10:57:46] mw1033 != mw1233 [10:57:54] <_joe_> for both [10:57:57] * matanya is stupid [11:04:31] PROBLEM - Apache HTTP on mw1233 is CRITICAL: Connection refused [11:05:38] PROBLEM - HHVM rendering on mw1233 is CRITICAL: Connection refused [11:06:19] PROBLEM - Apache HTTP on mw1236 is CRITICAL: Connection refused [11:07:19] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Puppet has 112 failures [11:07:49] PROBLEM - HHVM rendering on mw1234 is CRITICAL: Connection refused [11:08:20] PROBLEM - HHVM rendering on mw1235 is CRITICAL: Connection refused [11:08:54] PROBLEM - HHVM rendering on mw1236 is CRITICAL: Connection refused [11:10:08] PROBLEM - Apache HTTP on mw1234 is CRITICAL: Connection refused [11:10:33] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Puppet has 8 failures [11:10:38] PROBLEM - Apache HTTP on mw1235 is CRITICAL: Connection refused [11:10:52] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 112 failures [11:11:39] RECOVERY - HHVM rendering on mw1236 is OK: HTTP OK: HTTP/1.1 200 OK - 72551 bytes in 4.503 second response time [11:12:09] RECOVERY - Apache HTTP on mw1236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [11:12:59] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.701 second response time [11:13:10] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: /srv 18603 MB (3% inode=97%): [11:13:10] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:13:10] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.634 second response time [11:13:22] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:13:42] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.637 second response time [11:13:43] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 72551 bytes in 1.873 second response time [11:13:48] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:13:58] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 72551 bytes in 1.838 second response time [11:14:18] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 72551 bytes in 1.870 second response time [11:14:29] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:26:20] PROBLEM - check configured eth on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:01] PROBLEM - check if dhclient is running on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:20] PROBLEM - check if salt-minion is running on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:59] PROBLEM - nutcracker port on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:10] PROBLEM - nutcracker process on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:33] PROBLEM - DPKG on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:33] PROBLEM - puppet last run on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:37] PROBLEM - Disk space on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:29:20] PROBLEM - HHVM processes on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:30:10] PROBLEM - RAID on mw1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:31:34] RECOVERY - Disk space on mw1034 is OK: DISK OK [11:31:55] PROBLEM - puppet last run on mw1035 is CRITICAL: CRITICAL: Puppet has 112 failures [11:31:56] RECOVERY - check configured eth on mw1034 is OK: NRPE: Unable to read output [11:32:05] RECOVERY - HHVM processes on mw1034 is OK: PROCS OK: 1 process with command name hhvm [11:32:35] RECOVERY - check if dhclient is running on mw1034 is OK: PROCS OK: 0 processes with command name dhclient [11:32:55] RECOVERY - check if salt-minion is running on mw1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:33:08] RECOVERY - RAID on mw1034 is OK: OK: no RAID installed [11:33:37] RECOVERY - nutcracker port on mw1034 is OK: TCP OK - 0.000 second response time on port 11212 [11:33:38] RECOVERY - nutcracker process on mw1034 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:33:57] RECOVERY - DPKG on mw1034 is OK: All packages OK [11:39:37] PROBLEM - puppet last run on mw1034 is CRITICAL: CRITICAL: Puppet has 111 failures [11:45:28] PROBLEM - HHVM rendering on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:16] RECOVERY - puppet last run on mw1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:48:06] RECOVERY - HHVM rendering on mw1034 is OK: HTTP OK: HTTP/1.1 200 OK - 72514 bytes in 0.312 second response time [11:50:46] PROBLEM - puppet last run on mw1033 is CRITICAL: CRITICAL: Puppet has 112 failures [11:51:55] PROBLEM - HHVM rendering on mw1035 is CRITICAL: Connection refused [11:57:36] RECOVERY - HHVM rendering on mw1035 is OK: HTTP OK: HTTP/1.1 200 OK - 72515 bytes in 2.988 second response time [11:59:19] RECOVERY - puppet last run on mw1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:00:27] RECOVERY - puppet last run on mw1035 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:17:25] (03CR) 10Yuvipanda: [C: 032] extdist: Add composer location to config [puppet] - 10https://gerrit.wikimedia.org/r/176631 (owner: 10Legoktm) [12:18:05] legoktm: ^ merged [12:18:43] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet has 1 failures [12:19:11] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [12:30:04] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:39:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [12:56:00] PROBLEM - nutcracker process on mw1039 is CRITICAL: Connection refused by host [12:57:11] (03PS1) 10Giuseppe Lavagetto: monitoring: refine alarms on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/176653 [12:57:29] <_joe_> what part of "scheduled downtime" you don't understand, icinga? [12:58:03] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: refine alarms on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/176653 (owner: 10Giuseppe Lavagetto) [12:59:13] PROBLEM - puppet last run on mw1039 is CRITICAL: Connection refused by host [12:59:25] PROBLEM - check if dhclient is running on mw1039 is CRITICAL: Connection refused by host [13:00:07] PROBLEM - Apache HTTP on mw1039 is CRITICAL: Connection refused [13:00:25] PROBLEM - DPKG on mw1039 is CRITICAL: Connection refused by host [13:00:34] PROBLEM - SSH on mw1039 is CRITICAL: Connection refused [13:00:35] PROBLEM - Disk space on mw1039 is CRITICAL: Connection refused by host [13:01:17] PROBLEM - RAID on mw1039 is CRITICAL: Connection refused by host [13:01:27] PROBLEM - check if salt-minion is running on mw1039 is CRITICAL: Connection refused by host [13:01:29] PROBLEM - check configured eth on mw1039 is CRITICAL: Connection refused by host [13:01:49] PROBLEM - nutcracker port on mw1039 is CRITICAL: Connection refused by host [13:09:09] RECOVERY - SSH on mw1039 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [13:17:33] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.014 second response time [13:20:56] RECOVERY - check if dhclient is running on mw1039 is OK: PROCS OK: 0 processes with command name dhclient [13:21:15] RECOVERY - nutcracker port on mw1039 is OK: TCP OK - 0.000 second response time on port 11212 [13:21:15] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Puppet has 1 failures [13:21:34] RECOVERY - nutcracker process on mw1039 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:22:41] RECOVERY - Disk space on mw1039 is OK: DISK OK [13:24:04] RECOVERY - RAID on mw1039 is OK: OK: no RAID installed [13:24:35] RECOVERY - check configured eth on mw1039 is OK: NRPE: Unable to read output [13:24:52] RECOVERY - check if salt-minion is running on mw1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:25:06] RECOVERY - DPKG on mw1039 is OK: All packages OK [13:30:07] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 112 failures [13:32:30] (03PS1) 10Faidon Liambotis: Unbreak misc::statistics on <= precise systems [puppet] - 10https://gerrit.wikimedia.org/r/176656 [13:33:51] (03CR) 10Faidon Liambotis: [C: 032] Unbreak misc::statistics on <= precise systems [puppet] - 10https://gerrit.wikimedia.org/r/176656 (owner: 10Faidon Liambotis) [13:36:56] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [13:37:37] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:37:42] _joe_: since you're in a fixing HHVM alerts mood :), a bunch of "PROCS WARNING: 2 processes with command name 'hhvm'" alerts [13:37:58] <_joe_> paravoid: yes it's next on the list [13:38:07] also ocg is apparently full [13:38:08] again [13:38:10] <_joe_> when I wrote that I didn't expect we shelled out so often [13:38:26] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:38:28] <_joe_> paravoid: it was at 98% this morning and I forgot to check [13:38:36] <_joe_> only this time it's the /srv dir [13:38:39] 97%, 99% and 100% [13:38:49] also mw1190: eth0 has different negotiated speed than requested [13:39:03] and mw1039 with a bunch of errors [13:39:20] <_joe_> paravoid: that's me reimaging and scheduled downtime disappearing magically [13:39:33] 1039 you mean [13:39:35] <_joe_> yes [13:39:38] 1190 is a broken cable probably [13:40:32] <_joe_> yes [13:40:46] <_joe_> ok I'll take a look at ocg [13:40:49] <_joe_> gee [13:40:59] <_joe_> they probably changed the cache retention policy [13:41:41] godog: ms-be2014 runs puppet and isn't very happy about bcache [13:41:49] puppet isn't that is [13:44:00] <_joe_> !log removing cache files from ocg1001, when they're older than 3 days [13:44:06] Logged the message, Master [13:44:09] RECOVERY - check if salt-minion is running on stat1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:46:56] <_joe_> !log removing the same files from ocg1002,3 as well [13:46:58] Logged the message, Master [13:48:44] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:52:42] RECOVERY - Disk space on ocg1001 is OK: DISK OK [13:53:01] (03PS1) 10Faidon Liambotis: hhvm: remove check_procs check [puppet] - 10https://gerrit.wikimedia.org/r/176659 [13:53:04] RECOVERY - Disk space on ocg1003 is OK: DISK OK [13:53:11] _joe_: do you mind if I merge this? [13:53:26] <_joe_> not at all :) [13:54:03] <_joe_> It's quite useless now in fact [13:54:05] RECOVERY - Disk space on ocg1002 is OK: DISK OK [13:54:21] I can make it 1: [13:54:39] I guess that's better [13:54:50] I'll remove the -w 1:1 part [13:55:52] (03PS2) 10Faidon Liambotis: hhvm: remove check_procs' WARNING state [puppet] - 10https://gerrit.wikimedia.org/r/176659 [13:56:54] (03CR) 10Faidon Liambotis: [C: 032] hhvm: remove check_procs' WARNING state [puppet] - 10https://gerrit.wikimedia.org/r/176659 (owner: 10Faidon Liambotis) [14:11:24] (03PS1) 10Faidon Liambotis: ldap: fix LDAP's monitoring::service CN matching [puppet] - 10https://gerrit.wikimedia.org/r/176662 [14:11:48] alerting for 67 days, acknowledged with comment "foo bar baz" [14:12:58] (03CR) 10Faidon Liambotis: [C: 032] ldap: fix LDAP's monitoring::service CN matching [puppet] - 10https://gerrit.wikimedia.org/r/176662 (owner: 10Faidon Liambotis) [14:13:59] <_joe_> the comment was to acknowledge it was being ignored [14:16:20] <_joe_> !log repooling mw1036-mw1040 [14:16:25] Logged the message, Master [14:22:01] RECOVERY - Certificate expiration on labcontrol2001 is OK: SSL_CERT OK - X.509 certificate for ldap-codfw.wikimedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Sep 20 19:36:03 2015 GMT (expires in 293 days) [14:22:26] there we go :) [14:26:07] <_joe_> !log depooling mw1041-1046 [14:26:12] Logged the message, Master [14:27:50] SSL_CERT CRITICAL ldap-eqiad.wikimedia.org: invalid CN ('ldap-eqiad.wikimedia.org' does not match 'ldap-codfw.wikimedia.org') [14:27:57] interesting [14:32:45] (03PS1) 10Faidon Liambotis: ldap: neptunium is eqiad, virt1000 is no more. [puppet] - 10https://gerrit.wikimedia.org/r/176665 [14:33:31] paravoid: indeed it isn't amused, the obvious choice is having an additional parameter to have caching on/off, anyways yes it is on my radar [14:34:01] (03PS2) 10Faidon Liambotis: ldap: neptunium is eqiad, virt1000 is no more [puppet] - 10https://gerrit.wikimedia.org/r/176665 [14:37:59] (03CR) 10Faidon Liambotis: [C: 032] ldap: neptunium is eqiad, virt1000 is no more [puppet] - 10https://gerrit.wikimedia.org/r/176665 (owner: 10Faidon Liambotis) [14:48:51] RECOVERY - Certificate expiration on neptunium is OK: SSL_CERT OK - X.509 certificate for ldap-eqiad.wikimedia.org from GlobalSign Organization Validation CA - SHA256 - G2 valid until Sep 20 19:41:02 2015 GMT (expires in 293 days) [14:49:15] aaand there we go [14:49:44] hi manybubbles [14:50:23] paravoid: hi! [14:50:28] how are things? [14:50:49] good :) [14:50:56] had fun? [14:52:01] on my holiday? yes! lots of time with family [14:52:12] yeah :) [14:52:39] things looked exciting yesterday but otherwise pretty good while we were all out [14:53:28] <^d> g'morning paravoid, manybubbles [14:53:35] ^d: morning! [14:58:07] linked invitations are viral. you get one and you ignore it for a few days and then you are like "I should accept that" and when you do linkedin is like "do you know these 100 people?" and you are like, "yeah, mostly" and then you click "I know this person" until you get bored. its like a chain letter. [15:05:25] PROBLEM - Disk space on mw1041 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:25] PROBLEM - HHVM processes on mw1040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:05:56] <_joe_> grrr [15:06:41] PROBLEM - HHVM processes on mw1041 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:01] PROBLEM - puppet last run on mw1043 is CRITICAL: CRITICAL: Puppet has 112 failures [15:07:26] PROBLEM - RAID on mw1040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:42] PROBLEM - puppet last run on mw1044 is CRITICAL: CRITICAL: Puppet has 112 failures [15:07:42] PROBLEM - RAID on mw1041 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:42] PROBLEM - check configured eth on mw1040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:08:09] PROBLEM - check if dhclient is running on mw1040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:08:12] PROBLEM - puppet last run on mw1045 is CRITICAL: CRITICAL: Puppet has 112 failures [15:08:24] RECOVERY - Disk space on mw1041 is OK: DISK OK [15:08:24] RECOVERY - HHVM processes on mw1040 is OK: PROCS OK: 1 process with command name hhvm [15:08:44] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 8 failures [15:08:54] PROBLEM - DPKG on mw1046 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:08:56] _joe_: ? [15:09:00] what are all these? [15:09:11] <_joe_> servers being reimaged, with scheduled downtime [15:09:16] oh [15:09:18] <_joe_> but reimaging cleans them from nagios [15:09:20] heh [15:09:23] RECOVERY - HHVM processes on mw1041 is OK: PROCS OK: 1 process with command name hhvm [15:09:26] <_joe_> so, sometimes the downtime is lost [15:09:27] <_joe_> :/ [15:09:43] <_joe_> so either we find a clever way not to make naggen add those [15:09:49] <_joe_> and I think we might have [15:10:03] <_joe_> or it's going to be like this for every server we reimage [15:10:03] PROBLEM - DPKG on mw1041 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:10:13] RECOVERY - RAID on mw1040 is OK: OK: no RAID installed [15:10:26] naggen is adding these only after the services on the host are provisioned, no? [15:10:34] RECOVERY - RAID on mw1041 is OK: OK: no RAID installed [15:10:37] RECOVERY - check configured eth on mw1040 is OK: NRPE: Unable to read output [15:10:49] RECOVERY - check if dhclient is running on mw1040 is OK: PROCS OK: 0 processes with command name dhclient [15:11:29] <_joe_> paravoid: well, that's true but not for everything it works [15:11:33] RECOVERY - DPKG on mw1046 is OK: All packages OK [15:11:34] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 112 failures [15:12:04] <_joe_> for example, the apache alarm depend on apache, but apache doesn't get started after the first run, as scap runs on the second one [15:12:27] <_joe_> (the first failing because we need to accept the salt password, so installing scap usually fails) [15:12:53] RECOVERY - DPKG on mw1041 is OK: All packages OK [15:13:30] <_joe_> (also, the first 2 puppet runs on an appserver take ~ 45 minutes, it's usually much less on other servers) [15:17:54] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: Puppet has 7 failures [15:18:36] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:37] PROBLEM - HHVM rendering on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:47] RECOVERY - HHVM rendering on mw1041 is OK: HTTP OK: HTTP/1.1 200 OK - 73135 bytes in 0.303 second response time [15:23:45] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:24:15] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:30] <_joe_> !log repooling mw1041-mw1046 [15:24:35] Logged the message, Master [15:24:55] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:25:27] (03PS1) 10Andrew Bogott: -- DRAFT -- [puppet] - 10https://gerrit.wikimedia.org/r/176670 [15:25:29] (03PS1) 10Andrew Bogott: -- DRAFT -- [puppet] - 10https://gerrit.wikimedia.org/r/176671 [15:25:46] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:25:46] <_joe_> andrewbogott: gerrit review -D :P [15:26:22] _joe_: I never remember that until it's too late :) [15:26:29] <_joe_> !log depooling mw1047-mw1052 [15:26:32] Logged the message, Master [15:28:39] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:11] RECOVERY - puppet last run on mw1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:33:10] (03PS5) 10BBlack: Remove old protoproxy / ssl[13]00x config / star certs [puppet] - 10https://gerrit.wikimedia.org/r/175466 [15:35:25] RECOVERY - puppet last run on mw1043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:49:31] !log restarted logstash on logstash1001; log2udp events were not being processed [15:49:35] Logged the message, Master [15:51:02] manybubbles, ^d, marktraceur: Who wants to SWAT this morning? [15:51:17] anomie: i'd like to pass this morning if that is ok [15:51:25] * anomie would like to pass too [15:53:11] Not it. [15:53:15] I have to deal with car drama. [15:53:19] * cscott is never it [15:57:03] <^d> I guess me. Was hoping by keeping my mouth shut... [15:58:00] <_joe_> cscott: hey [15:58:18] <_joe_> I had to manually purge files older than 3 days from ocg100* [15:58:28] <_joe_> they had their /srv partition full [16:00:04] manybubbles, anomie, ^d, marktraceur, Glaisher: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141201T1600). [16:01:17] <^d> Glaisher: you about? [16:01:23] _joe_: orly? [16:01:36] ^d: yes, I am [16:01:44] <^d> Okay, let's do this :) [16:01:44] _joe_: let me take a look at icinga, that shouldn't happen [16:01:50] (03CR) 10Chad: [C: 032] Add 'move-subpages' right to "closer" and "filemover" groups at ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176272 (owner: 10Glaisher) [16:01:52] (03CR) 10Chad: [C: 032] Restore default configuration for ruwikisource bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176287 (owner: 10Glaisher) [16:01:54] (03CR) 10Chad: [C: 032] Modify abusefilter configuration for metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176504 (owner: 10Glaisher) [16:01:57] wait what.. jouncebot pings me too now.. ^^ [16:02:08] (03Merged) 10jenkins-bot: Add 'move-subpages' right to "closer" and "filemover" groups at ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176272 (owner: 10Glaisher) [16:02:09] <^d> Maybe? [16:02:13] <^d> Hmm [16:02:17] (03Merged) 10jenkins-bot: Restore default configuration for ruwikisource bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176287 (owner: 10Glaisher) [16:02:25] (03Merged) 10jenkins-bot: Modify abusefilter configuration for metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176504 (owner: 10Glaisher) [16:04:14] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 05s) [16:04:17] Logged the message, Master [16:04:32] !log demon Synchronized wmf-config/abusefilter.php: (no message) (duration: 00m 05s) [16:04:34] Logged the message, Master [16:05:10] <^d> Glaisher: All done [16:05:12] _joe_: what hapened on sunday with OCG? looks like the redis job queue was completely wiped out [16:05:25] looking at https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=PDF+servers+eqiad&h=&tab=m&vn=&hide-hf=false&m=ocg_job_status_queue&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [16:05:44] jouncebot pings every {{ircnick}} template in the deploy window (or at least is meant to) [16:05:47] ^d: All live.. Thanks :) [16:06:11] PROBLEM - DPKG on mw1052 is CRITICAL: Timeout while attempting connection [16:06:11] <^d> Glaisher: yw [16:06:46] PROBLEM - Disk space on mw1052 is CRITICAL: Timeout while attempting connection [16:07:18] _joe_: and https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=PDF+servers+eqiad&h=&tab=m&vn=&hide-hf=false&m=ocg_data_filesystem_utilization&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name seems to indicate that something broke in the GC three weeks ago, i'll have to look at that [16:07:41] PROBLEM - RAID on mw1052 is CRITICAL: Connection refused by host [16:08:13] PROBLEM - check configured eth on mw1052 is CRITICAL: Connection refused by host [16:08:20] PROBLEM - check if dhclient is running on mw1052 is CRITICAL: Connection refused by host [16:08:40] PROBLEM - check if salt-minion is running on mw1052 is CRITICAL: Connection refused by host [16:09:13] PROBLEM - nutcracker port on mw1052 is CRITICAL: Connection refused by host [16:09:21] PROBLEM - nutcracker process on mw1052 is CRITICAL: Connection refused by host [16:09:41] PROBLEM - puppet last run on mw1052 is CRITICAL: Connection refused by host [16:09:41] PROBLEM - puppet last run on mw1048 is CRITICAL: CRITICAL: Puppet has 112 failures [16:09:58] <_joe_> looks like the downtime finished [16:10:15] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: Puppet has 112 failures [16:10:22] PROBLEM - Apache HTTP on mw1052 is CRITICAL: Connection refused [16:13:21] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.006 second response time [16:14:43] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: Puppet has 112 failures [16:16:13] PROBLEM - HHVM rendering on mw1047 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 687 bytes in 5.249 second response time [16:17:08] (03PS1) 10RobH: setting mgmt ip for server heze [dns] - 10https://gerrit.wikimedia.org/r/176674 [16:17:43] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:17:46] (03CR) 10RobH: [C: 032] setting mgmt ip for server heze [dns] - 10https://gerrit.wikimedia.org/r/176674 (owner: 10RobH) [16:19:12] RECOVERY - RAID on mw1052 is OK: OK: no RAID installed [16:19:39] RECOVERY - check configured eth on mw1052 is OK: NRPE: Unable to read output [16:19:42] RECOVERY - check if dhclient is running on mw1052 is OK: PROCS OK: 0 processes with command name dhclient [16:20:02] RECOVERY - check if salt-minion is running on mw1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:20:59] RECOVERY - nutcracker port on mw1052 is OK: TCP OK - 0.000 second response time on port 11212 [16:21:58] RECOVERY - DPKG on mw1052 is OK: All packages OK [16:21:58] RECOVERY - nutcracker process on mw1052 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:22:46] ^d: Still SWATting? [16:22:51] <^d> No, all done [16:22:59] RECOVERY - Disk space on mw1052 is OK: DISK OK [16:23:07] ok, I'm going to backport https://gerrit.wikimedia.org/r/#/c/176673/ [16:24:30] <_joe_> ^d: ouch I was still reimaging servers, did you have any failures? [16:24:37] RECOVERY - puppet last run on mw1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:24:41] <_joe_> I'm not repooling them until tomorrow then [16:25:01] <^d> Oh hmm, 1050-52? [16:25:37] <^d> 48? [16:26:21] <_joe_> 47-52 [16:26:46] <_joe_> and 50-52 are still being reimaged [16:26:48] <_joe_> brb [16:27:29] RECOVERY - HHVM rendering on mw1047 is OK: HTTP OK: HTTP/1.1 200 OK - 73196 bytes in 5.151 second response time [16:27:37] PROBLEM - puppet last run on mw1050 is CRITICAL: CRITICAL: Puppet has 112 failures [16:27:54] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 112 failures [16:28:11] PROBLEM - puppet last run on mw1051 is CRITICAL: CRITICAL: Puppet has 112 failures [16:29:20] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:32:57] (03PS1) 10Chad: Don't load OpenSeachXml if it's in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176679 [16:33:44] (03PS2) 10Chad: Don't load OpenSeachXml if it's in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176679 [16:33:46] Who has live hacks on 1.25wmf10 on tin? Grr. [16:34:28] (03CR) 10BryanDavis: [C: 031] Don't load OpenSeachXml if it's in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176679 (owner: 10Chad) [16:35:32] !log anomie Synchronized php-1.25wmf10/extensions/SyntaxHighlight_GeSHi/geshi/geshi.php: SWAT: Fix highly recursive number highlighting regex in GeSHi (duration: 00m 10s) [16:35:33] anomie: ^ test please [16:35:36] Logged the message, Master [16:35:40] anomie: Looks good [16:35:51] <^d> anomie: mwtidy? my guess would be tim [16:37:01] (03CR) 10Chad: [C: 032] Don't load OpenSeachXml if it's in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176679 (owner: 10Chad) [16:37:14] (03Merged) 10jenkins-bot: Don't load OpenSeachXml if it's in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/176679 (owner: 10Chad) [16:37:15] !log anomie Synchronized php-1.25wmf9/extensions/SyntaxHighlight_GeSHi/geshi/geshi.php: SWAT: Fix highly recursive number highlighting regex in GeSHi (duration: 00m 07s) [16:37:15]