[00:02:53] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [00:03:47] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [00:11:35] PROBLEM - Puppet freshness on srv245 is CRITICAL: Puppet has not run in the last 10 hours [00:32:35] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [00:34:23] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [00:35:17] RECOVERY - MySQL disk space on neon is OK: DISK OK [00:43:45] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [00:46:05] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [01:10:32] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:57:23] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 196 seconds [01:58:25] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 239 seconds [01:58:26] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 218 seconds [01:58:52] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 232 seconds [02:02:10] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 1 seconds [02:02:28] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 1 seconds [02:12:49] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 189 seconds [02:13:34] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 197 seconds [02:15:22] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (315333), Total (318470) [02:17:28] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (306538), Total (313663) [02:25:07] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [02:29:30] !log LocalisationUpdate completed (1.21wmf10) at Sun Feb 24 02:29:29 UTC 2013 [02:29:34] Logged the message, Master [02:34:25] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 1 seconds [02:35:10] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [02:54:31] !log LocalisationUpdate completed (1.21wmf9) at Sun Feb 24 02:54:30 UTC 2013 [02:54:33] Logged the message, Master [03:50:31] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 184 seconds [03:56:13] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [03:58:46] PROBLEM - Puppet freshness on sq73 is CRITICAL: Puppet has not run in the last 10 hours [03:59:49] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [04:21:43] PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours [04:27:43] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [04:38:49] PROBLEM - Puppet freshness on lardner is CRITICAL: Puppet has not run in the last 10 hours [04:40:11] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 189 seconds [04:40:37] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 187 seconds [04:41:58] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:42:25] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [04:47:49] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [04:47:49] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [04:55:28] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [04:55:55] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [05:10:46] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [05:23:49] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 207 seconds [05:23:58] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 181 seconds [05:24:07] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 219 seconds [05:25:46] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 1 seconds [05:32:17] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:35:17] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 189 seconds [05:35:45] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.033 second response time on port 8123 [05:35:55] 9 seconds... [05:36:02] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 195 seconds [05:36:09] that's higher than past recoveries [05:36:29] and is just below the imperical threshold for the intermittent search problems bug [05:38:02] errrr, empirical [05:38:03] * [05:38:08] * jeremyb_ is sleepy? [05:39:38] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:39:50] hrm [05:40:14] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds [05:40:14] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 185 seconds [05:41:08] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.025 second response time on port 8123 [05:49:41] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:52:03] apergos: ^ [05:52:44] (twice in 15 mins and also the search1016 flap too) [05:56:44] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.023 second response time on port 8123 [06:01:05] RECOVERY - MySQL disk space on neon is OK: DISK OK [06:03:02] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [06:09:38] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 6 seconds [06:10:50] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [06:11:45] pool4 (probably search1016) definitely needs a kick [06:14:17] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:15:22] wooo [06:21:47] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [06:22:41] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:24:12] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:28:55] !log restarted search1016 lucene [06:28:57] Logged the message, Master [06:30:49] good morning [06:31:06] mostly not here of course (sunday am) [06:32:25] yah [06:34:21] * jeremyb_ runs away [06:36:10] (all 3 of my test queries on commons that were returning empty every time and slowly now work fairly fast and not emmpty.) [06:40:32] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [06:51:29] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.25 ms [06:53:17] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:55:14] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 188 seconds [06:55:50] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 200 seconds [07:09:38] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [07:11:53] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [07:14:10] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:14:44] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:17:27] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [07:18:29] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [07:49:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:55:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.066 seconds [07:56:53] PROBLEM - Puppet freshness on mw64 is CRITICAL: Puppet has not run in the last 10 hours [07:56:53] PROBLEM - Puppet freshness on mw1039 is CRITICAL: Puppet has not run in the last 10 hours [08:02:26] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds [08:03:56] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 239 seconds [08:24:29] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (487561), Total (488689) [08:25:05] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (486030), Total (487924) [08:26:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.886 seconds [08:39:56] PROBLEM - Puppet freshness on db1009 is CRITICAL: Puppet has not run in the last 10 hours [08:41:53] PROBLEM - Puppet freshness on mw1134 is CRITICAL: Puppet has not run in the last 10 hours [09:18:28] RECOVERY - Puppet freshness on srv245 is OK: puppet ran at Sun Feb 24 09:18:13 UTC 2013 [09:23:25] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [09:24:01] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [09:54:37] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [09:54:37] RECOVERY - MySQL disk space on neon is OK: DISK OK [10:08:43] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 195 seconds [10:09:37] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 218 seconds [10:34:16] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [10:36:40] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [10:37:07] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [11:11:19] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:22:55] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 190 seconds [11:23:23] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [11:23:49] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:40] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.31 ms [11:36:07] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [11:36:16] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [12:26:31] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:29:49] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 193 seconds [12:30:24] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 221 seconds [12:37:34] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [12:38:01] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [13:26:13] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:26:23] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:07] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:32:04] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds [14:00:34] PROBLEM - Puppet freshness on sq73 is CRITICAL: Puppet has not run in the last 10 hours [14:01:28] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [14:18:08] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:21:25] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [14:29:45] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [14:30:21] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 197 seconds [14:30:48] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 207 seconds [14:34:24] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [14:34:44] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50467 [14:35:45] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 22 seconds [14:39:57] PROBLEM - Puppet freshness on lardner is CRITICAL: Puppet has not run in the last 10 hours [14:48:57] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [14:48:57] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [15:08:27] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [15:08:54] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:03] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [15:13:15] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:09] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:17:45] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [15:19:15] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.055 seconds [15:20:36] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.50 ms [15:26:36] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:28:42] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [15:43:53] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (614061), Total (620125) [15:53:11] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 181 seconds [15:53:47] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 195 seconds [16:06:24] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [16:06:59] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [16:07:35] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [16:07:44] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [16:13:29] New patchset: Krinkle; "checkoutMediaWiki: Remove redundant argument to git-submodule." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50608 [16:13:32] New review: Krinkle; "(1 comment)" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/14864 [16:13:46] New review: Krinkle; "(1 comment)" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/14864 [16:34:35] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:41] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:50] RECOVERY - MySQL disk space on neon is OK: DISK OK [16:37:35] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [16:38:56] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [16:51:05] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.94 ms [16:52:36] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:54:24] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [16:56:30] New patchset: Krinkle; "Remove dead Aliases "for 1.17 wikis" from bits conf." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50609 [16:57:00] New review: Krinkle; "These directories no longer exist in live-1.5." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/50609 [17:00:32] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [17:01:17] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [17:08:38] RECOVERY - Puppet freshness on mw1134 is OK: puppet ran at Sun Feb 24 17:08:21 UTC 2013 [17:27:37] New patchset: Krinkle; "Clean up old multiversion readme." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50611 [17:30:14] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [17:31:22] New patchset: Krinkle; "Clean up common/README." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50612 [17:31:44] RECOVERY - MySQL disk space on neon is OK: DISK OK [17:33:10] New patchset: Krinkle; "Clean up common/README." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50612 [17:33:32] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Sun Feb 24 17:33:28 UTC 2013 [17:37:08] RECOVERY - Puppet freshness on mw64 is OK: puppet ran at Sun Feb 24 17:36:53 UTC 2013 [17:37:35] PROBLEM - SSH on lvs1002 is CRITICAL: Server answer: [17:38:02] RECOVERY - Puppet freshness on mw1039 is OK: puppet ran at Sun Feb 24 17:37:56 UTC 2013 [17:41:11] RECOVERY - SSH on lvs1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:51:51] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:29] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:02:23] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [18:02:23] PROBLEM - Puppet freshness on mw1070 is CRITICAL: Puppet has not run in the last 10 hours [18:05:53] * jeremyb_ sees some unresolved worrisome alerts. e.g. varnish bits. but idk where bits is served from these days [18:06:03] and lvs1002/cp3003 were both bouncing [18:07:20] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 9.057 seconds [18:07:56] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:25:38] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:26:14] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:26] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [18:31:56] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:32:23] PROBLEM - Puppet freshness on mw1059 is CRITICAL: Puppet has not run in the last 10 hours [18:33:26] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [18:33:26] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [18:33:26] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [18:33:27] PROBLEM - Puppet freshness on mc1001 is CRITICAL: Puppet has not run in the last 10 hours [18:33:27] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [18:33:27] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [18:35:23] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [18:35:23] PROBLEM - Puppet freshness on mw1157 is CRITICAL: Puppet has not run in the last 10 hours [18:35:23] PROBLEM - Puppet freshness on srv235 is CRITICAL: Puppet has not run in the last 10 hours [18:37:20] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [18:37:38] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:47] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 184 seconds [18:40:56] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 189 seconds [18:41:05] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 9.125 seconds [18:41:14] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:47:05] PROBLEM - Varnish HTTP bits on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:59] PROBLEM - SSH on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:53] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 8 seconds [18:52:30] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [19:02:49] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [19:04:10] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [19:08:05] wooo, wikidata's job queue rollercoaster :) [19:10:19] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 1 seconds [19:10:28] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [19:15:52] PROBLEM - NTP on strontium is CRITICAL: NTP CRITICAL: No response from NTP server [19:28:23] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Cant find record in page_restrictions on query. Default da [19:35:39] hrm, no Tim-away/binasher/domas [19:35:47] (for the repl error above) [19:48:43] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (702571), plwiki (47506), Total (751563) [19:48:54] jeremyb_, probably not deadly urgent as it's a pmtpa slave [19:49:28] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (699362), plwiki (45112), Total (745335) [19:49:44] MaxSem: yeah, i never know what's important there. at least toolserver is more often slaving off an arbitrary slave than off an intermediate master? [19:50:32] and last time we had a replication error like that it was 3 boxen at once not just the one [19:50:40] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 202 seconds [19:51:13] (i doubt it's an emergency but i think it's at least worth looking into sometime today) [19:51:16] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 220 seconds [20:20:17] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [20:20:17] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 187 seconds [20:20:44] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 2 seconds [20:20:53] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 198 seconds [20:23:53] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:24:29] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [20:35:53] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [21:08:44] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.053 seconds [21:12:56] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:14:35] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [21:18:57] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [21:26:36] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [21:28:24] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [21:52:30] New patchset: Legoktm; "(bug 45079) Add P: as alias for Property namespace" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50709 [21:57:12] RECOVERY - MySQL disk space on neon is OK: DISK OK [21:57:12] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [22:10:06] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [22:22:05] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.53 ms [22:28:05] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [22:34:50] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [22:35:45] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [22:57:38] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 210 seconds [22:58:06] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 218 seconds [23:06:20] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [23:07:14] RECOVERY - MySQL disk space on neon is OK: DISK OK [23:51:49] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: HTTP CRITICAL - No data received from host [23:51:58] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds