[00:04:24] New patchset: Lcarr; "adding in firewall config builder information" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6055 [00:04:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6055 [00:17:11] !log db12 is sooooo sloooooow, starting innobackupex from db1017 to db60 for new s1 slave [00:17:13] Logged the message, notpeter [00:19:11] PROBLEM - MySQL Slave Delay on db1022 is CRITICAL: CRIT replication delay 235 seconds [00:19:38] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: CRIT replication delay 261 seconds [00:21:32] New review: Lcarr; "incremental change, still need to try putting this on a few servers and seeing how it performs" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6055 [00:21:35] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6055 [00:33:19] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay 0 seconds [00:33:28] RECOVERY - MySQL Slave Delay on db1022 is OK: OK replication delay 0 seconds [00:44:23] /var/log/mw/fatal.log is 16G on fenari [00:44:29] Can we remove it and start again? [00:44:46] (ignoring the fact it should probably be logrotated) [00:46:02] remove/rename/whatever [00:46:33] It's owned by nobody/root so will need someone to do it please! :) [00:46:46] done Reedy [00:46:56] Thanks! [00:46:59] ok, on that note, heading off [00:47:06] Have a good weekend [00:47:10] thanks, you too [00:52:58] PROBLEM - Host mw38 is DOWN: PING CRITICAL - Packet loss = 100% [00:54:10] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:10] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:10] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:10] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:10] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:19] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:19] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:19] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:19] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.38:11000 (Connection timed out) [00:54:28] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:28] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:28] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:55] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:04] PROBLEM - Apache HTTP on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:04] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:04] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:13] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:13] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:13] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:13] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:22] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:22] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:22] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:22] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:22] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:23] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:23] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:24] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:24] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:31] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:31] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:31] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:31] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:31] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:40] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:40] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:40] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.700 second response time [00:56:52] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.820 second response time [00:57:06] wtf? [00:57:46] RECOVERY - Apache HTTP on mw4 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.971 second response time [00:57:46] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.087 second response time [00:58:04] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.032 second response time [00:58:13] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.629 second response time [00:58:13] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.104 second response time [00:58:22] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.875 second response time [00:58:31] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.805 second response time [00:59:07] RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [00:59:16] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.548 second response time [00:59:35] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:35] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.417 second response time [01:01:04] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.860 second response time [01:01:04] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.922 second response time [01:01:22] RECOVERY - Apache HTTP on mw13 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.162 second response time [01:01:22] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.072 second response time [01:01:22] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.519 second response time [01:02:34] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.589 second response time [01:02:43] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.910 second response time [01:02:52] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.784 second response time [01:03:37] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.208 second response time [01:03:55] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.217 second response time [01:04:13] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.732 second response time [01:06:01] RECOVERY - Apache HTTP on mw5 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.246 second response time [01:06:03] New patchset: Reedy; "RT #1424 (Set up log rotation for wmerrors log)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6061 [01:06:18] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/6061 [01:06:37] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.810 second response time [01:06:46] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [01:07:04] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [01:07:04] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.239 second response time [01:07:04] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.020 second response time [01:07:04] RECOVERY - Apache HTTP on mw3 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [01:07:04] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [01:07:05] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.101 second response time [01:07:31] RECOVERY - Apache HTTP on mw15 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [01:07:31] RECOVERY - Apache HTTP on mw16 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.018 second response time [01:07:31] RECOVERY - Apache HTTP on mw7 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.963 second response time [01:07:49] RECOVERY - Apache HTTP on mw14 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.878 second response time [01:09:24] New patchset: Reedy; "RT #1424 (Set up log rotation for wmerrors log)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6061 [01:09:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6061 [01:19:22] RECOVERY - Host mw38 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [01:20:16] PROBLEM - Host mw29 is DOWN: PING CRITICAL - Packet loss = 100% [01:21:41] !log powercycled mw38 [01:21:48] Logged the message, Master [01:21:53] and still trying to figure what the hell is going on [01:22:40] PROBLEM - Apache HTTP on mw38 is CRITICAL: Connection refused [01:27:19] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [01:33:00] !log powecycled mw29 [01:33:03] Logged the message, Master [01:33:06] *sigh* [01:35:11] RECOVERY - Host mw29 is UP: PING OK - Packet loss = 0%, RTA = 1.81 ms [01:36:32] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [01:37:08] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:38] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [01:38:56] PROBLEM - Apache HTTP on mw29 is CRITICAL: Connection refused [01:41:38] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [01:42:41] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 213 seconds [01:49:44] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 3 seconds [01:57:14] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [01:57:32] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [02:00:23] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [02:24:09] !log rebooting all mediawiki boxes that have uptimes affected by the bug are being rebooted at 8 minute intervals [02:24:13] Logged the message, Master [02:31:08] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.211:11000 (Connection refused) [02:32:29] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:37:44] PROBLEM - Host srv198 is DOWN: PING CRITICAL - Packet loss = 100% [02:39:59] RECOVERY - Host srv198 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [02:42:59] PROBLEM - Apache HTTP on srv198 is CRITICAL: Connection refused [02:46:53] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.203:11000 (Connection refused) [02:48:23] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:50:02] PROBLEM - Apache HTTP on srv203 is CRITICAL: Connection refused [02:54:23] PROBLEM - Host srv195 is DOWN: PING CRITICAL - Packet loss = 100% [02:55:35] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [02:55:53] RECOVERY - Host srv195 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [02:59:20] PROBLEM - Apache HTTP on srv195 is CRITICAL: Connection refused [03:06:50] PROBLEM - Apache HTTP on srv190 is CRITICAL: Connection refused [03:07:17] RECOVERY - Apache HTTP on srv198 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.012 second response time [03:10:35] PROBLEM - Host srv196 is DOWN: PING CRITICAL - Packet loss = 100% [03:11:47] RECOVERY - Host srv196 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [03:14:56] PROBLEM - Apache HTTP on srv196 is CRITICAL: Connection refused [03:21:50] PROBLEM - Apache HTTP on srv205 is CRITICAL: Connection refused [03:26:20] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [03:26:47] RECOVERY - Apache HTTP on srv190 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [03:30:14] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.653 second response time [03:30:32] PROBLEM - Apache HTTP on srv240 is CRITICAL: Connection refused [03:36:33] PROBLEM - NTP on srv204 is CRITICAL: NTP CRITICAL: Offset unknown [03:37:18] RECOVERY - Apache HTTP on srv196 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [03:37:18] PROBLEM - Apache HTTP on srv204 is CRITICAL: Connection refused [03:39:24] RECOVERY - NTP on srv204 is OK: NTP OK: Offset 0.01462209225 secs [03:39:24] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [03:42:42] PROBLEM - Host srv202 is DOWN: PING CRITICAL - Packet loss = 100% [03:44:21] RECOVERY - Host srv202 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [03:44:48] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.23:11000 (timeout) [03:47:21] PROBLEM - Apache HTTP on srv202 is CRITICAL: Connection refused [03:47:30] RECOVERY - Apache HTTP on srv204 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [03:47:48] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [03:52:09] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.47:11000 (timeout) [03:53:39] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [03:54:06] PROBLEM - Apache HTTP on srv209 is CRITICAL: Connection refused [03:58:36] PROBLEM - Host srv201 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:24] RECOVERY - Host srv201 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [04:04:18] PROBLEM - Apache HTTP on srv201 is CRITICAL: Connection refused [04:05:21] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [04:06:42] PROBLEM - Host srv197 is DOWN: PING CRITICAL - Packet loss = 100% [04:07:00] RECOVERY - Apache HTTP on srv201 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [04:07:54] RECOVERY - mysqld processes on db60 is OK: PROCS OK: 1 process with command name mysqld [04:08:12] RECOVERY - Apache HTTP on srv202 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [04:08:21] RECOVERY - Host srv197 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [04:12:06] PROBLEM - Apache HTTP on srv197 is CRITICAL: Connection refused [04:13:54] PROBLEM - Host srv229 is DOWN: PING CRITICAL - Packet loss = 100% [04:14:21] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 2877 seconds [04:14:30] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 2852 seconds [04:16:00] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.34:11000 (timeout) [04:16:18] RECOVERY - Apache HTTP on srv197 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [04:16:18] RECOVERY - Host srv229 is UP: PING OK - Packet loss = 0%, RTA = 1.97 ms [04:17:30] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [04:19:18] PROBLEM - Apache HTTP on srv229 is CRITICAL: Connection refused [04:29:45] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [04:30:12] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 1 seconds [04:40:06] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [08:36:31] PROBLEM - Host mw32 is DOWN: PING CRITICAL - Packet loss = 100% [08:38:01] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.32:11000 (Connection timed out) [08:38:55] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:04] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:55] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:42:40] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:44:46] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.383 second response time [08:44:55] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:44:55] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.482 second response time [08:45:04] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:45:31] RECOVERY - Apache HTTP on mw15 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.933 second response time [08:46:16] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.138 second response time [08:46:25] RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.162 second response time [08:49:16] RECOVERY - Apache HTTP on mw16 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.518 second response time [08:53:01] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [10:55:54] PROBLEM - Host mw46 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:56] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [12:18:26] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:37:18] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:14:26] PROBLEM - Host mw33 is DOWN: PING CRITICAL - Packet loss = 100% [17:43:46] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:55] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:44:04] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:07] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.164 second response time [17:45:16] RECOVERY - Apache HTTP on mw15 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [17:45:16] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [18:34:49] PROBLEM - Host mw44 is DOWN: PING CRITICAL - Packet loss = 100% [18:53:52] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [20:43:30] PROBLEM - Host mw30 is DOWN: PING CRITICAL - Packet loss = 100% [20:45:09] PROBLEM - Host mw11 is DOWN: PING CRITICAL - Packet loss = 100% [20:47:51] i'm getting 503's on the mobile site for anything not in varnish [21:00:00] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 190 seconds [21:00:18] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 192 seconds [21:00:45] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 205 seconds [21:00:48] it means it is broken! [21:00:54] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 214 seconds [21:08:51] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 10 seconds [21:09:27] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [21:11:06] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [21:11:24] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [21:59:50] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [22:44:05] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiktionary (24926) [22:44:50] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiktionary (24763) [22:53:54] !log Job queue logs on gdash seem to have stopped on the 26th... [22:53:56] Logged the message, Master [22:55:59] Or logging of everything it seems