[00:03:00] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 241 seconds [00:03:27] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 270 seconds [00:04:48] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [00:05:24] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.876 seconds [00:05:24] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.883 seconds [00:09:18] PROBLEM - Lucene on search6 is CRITICAL: Connection timed out [00:09:18] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:18] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:12:00] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.631 seconds [00:12:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.625 seconds [00:13:12] RECOVERY - Lucene on search6 is OK: TCP OK - 8.992 second response time on port 8123 [00:14:51] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [00:15:54] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:54] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.672 seconds [00:19:57] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.699 seconds [00:21:18] PROBLEM - Lucene on search6 is CRITICAL: Connection timed out [00:25:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:39] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [00:25:39] so how do the varnish caches decide to send something to wap on ekrem ? [00:25:58] could we maybe have them send directly to mobile.wikipedia.org ? [00:25:59] they don't [00:26:02] bypassing the server ? [00:26:07] oh ? [00:26:10] varnish doesn't hit that server at all [00:26:15] RECOVERY - Lucene on search6 is OK: TCP OK - 0.002 second response time on port 8123 [00:26:28] so if i'm reading correctly, it looks like most of the requests are coming from 10.64.0.137 (cp1015) [00:31:48] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.865 seconds [00:31:48] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.863 seconds [00:32:04] ok, i only know about varnish re: that stuff. ekrem is a mysterious one off and i'm not sure what squd sends there.. Ryan_Lane? ^^ [00:32:59] also, maybe switch the 302 to a 301 ? another possibly interesting tidbit is that they're all using HTTP/1.0 [00:33:22] squid uses http/1.0 [00:33:34] it's hi-tech like that [00:33:53] ah [00:33:54] :) [00:35:11] !log changing search15 to run regular search-pool2 indexes instead of highlights [00:35:14] Logged the message, Master [00:36:20] preilly: do you know anything about en.wap.wikimedia.org? [00:38:02] binasher: nope [00:38:52] maybe we should leave it down.. [00:40:46] probably should [00:40:56] LeslieCarr: ^^ [00:40:59] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:02] heh [00:41:04] good timing [00:41:07] I'm about to start deleting broken thumbs. [00:41:08] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:13] robla: there's depoly testing stuff going on, right? [00:41:17] well it's up right now - just rather broken [00:41:25] any chance I should hold off because of it? [00:41:31] or is just warning that I'm doing it sufficient? [00:41:41] so i only see one main referrer and that is tracfone [00:41:47] a little bit of yahoo mobile [00:41:53] maplebed: from swift alone? [00:41:54] LeslieCarr: while looking at logs, did you get an idea of what kind of requests are hitting it? othat htan from squid [00:42:07] what's a full url? [00:42:07] AaronSchulz: from swift and from squid. [00:42:15] maplebed: oh, right, of course [00:42:18] "http://us.m2.yahoo.com/w/ygo-onesearch?__redir=1&submit=oneSearch&.tsrc=attosus&first=1&p=wikipedia&bzc=55806" [00:42:25] or most commonly "http://m3.tracfone.com/search" [00:42:30] wap.wikipedia.org:80 10.64.0.137 - - [15/Feb/2012:20:59:39 +0000] "GET / HTTP/1.0" 302 550 "http://m3.tracfone.com/search" "LG-LG231C[TF268435460901625418000000019036500542] UP.Browser/6.2.3.8 (GUI) MMP/2.0 [00:42:44] maplebed: I think it should be fine. [00:42:47] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (24029) [00:42:47] I'd go ahead and watch the graphs [00:42:50] thanks robla. [00:42:54] http://m3.tracfone.com/mobilewebsites [00:43:03] the only bad effect I can forsee is ms5 getting overloaded if I accidentally purge too much too fast. [00:43:05] http://m3.tracfone.com/redir/wikipedia [00:43:09] ok, so en.wap just redirects to en.mobile [00:43:10] I'll watch ms5 and kill if I see issues. [00:43:25] maplebed: before proceeding though... [00:43:28] hehe i saw preilly's request from there :) [00:43:39] I get http://en.mobile.wikipedia.org/ [00:43:48] yeah, then it 302's to en.mobile [00:43:54] maplebed: let's move the conversation to #wikimedia-tech [00:43:57] did it take like 8-10 seconds to finally get there ? [00:44:13] yes [00:44:21] yeah, that's all ekrem delay :( [00:44:43] but i figured if we can push that redirect up to the squids ? [00:44:53] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.856 seconds [00:45:02] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.530 seconds [00:45:19] it would be better to move the dns to point to the mobile varnish cluster [00:45:27] and have varnish do the redirect [00:45:30] totally [00:45:42] LeslieCarr: want to take a shot at that? templates/varnish/mobile-frontend.inc.vcl.erb for the varnish bit [00:45:59] cool [00:48:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:49:05] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:52:32] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [00:52:32] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [00:53:39] ^ that's racist nagios [00:55:46] !log rebooting prototype.wikimedia.org [00:55:48] Logged the message, Master [00:56:53] New patchset: Lcarr; "Adding mobile wap to redirect to new mobile site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2620 [00:57:47] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 204 seconds [00:57:47] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 206 seconds [00:58:15] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2606 [00:58:15] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2606 [01:00:47] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.547 seconds [01:04:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:05] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.260 seconds [01:13:30] New patchset: Lcarr; "Adding mobile wap to redirect to new mobile site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2620 [01:14:08] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:13] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2620 [01:15:13] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2620 [01:18:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.373 seconds [01:19:50] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.946 seconds [01:22:14] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:24:20] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:56] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.831 seconds [01:25:32] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:27:56] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:32:28] binasher: so do we need to restart varnish on the machines or should it pick it up ? [01:32:37] it should pick it up [01:32:42] cool [01:34:14] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.483 seconds [01:34:32] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.402 seconds [01:37:23] PROBLEM - Host formey is DOWN: CRITICAL - Host Unreachable (208.80.152.147) [01:37:59] hrm, so queries are still going through, how would i check which hosts the lvs forwards these on to ? (there's no "wap" section on the list on noc ) binasher ? [01:38:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:26] RECOVERY - Host formey is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [01:38:35] LeslieCarr: after varnish gets updated, next thing is to change dns [01:38:35] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:17] okay, so it currently points to wikipedia-lb.wikimedia.org [01:44:57] mobile-lb.eqiad.wikimedia.org is what it should be instead [01:45:41] i'm iffy on the formatting in the pdns templates for the language A record expansion [01:47:26] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.538 seconds [01:50:35] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [01:50:44] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [01:51:38] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:50] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 196 seconds [01:54:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.382 seconds [01:55:32] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [01:55:59] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 613s [01:56:06] !log 1.19 schema migraitons now running on enwiki slaves [01:56:08] Logged the message, Master [01:56:17] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 628s [02:00:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:14] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.744 seconds [02:01:59] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 216 seconds [02:02:08] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 223 seconds [02:03:38] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.884 seconds [02:05:08] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:13:23] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:14:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:16:00] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:20:05] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [02:21:53] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [02:22:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2619 [02:23:44] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.256 seconds [02:27:38] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:58] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [02:39:47] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.441 seconds [02:43:32] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:44:26] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:35] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:49:41] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.916 seconds [02:53:44] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:56:17] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.652 seconds [03:04:23] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:59] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 5.572 seconds [03:09:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:32] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.927 seconds [03:14:26] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:18:30] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.433 seconds [03:20:17] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.360 seconds [03:25:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:42] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.886 seconds [03:38:42] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [03:38:51] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 0 seconds [03:41:15] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 195 seconds [03:42:09] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 196 seconds [03:42:45] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:45:45] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.149 seconds [03:46:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.472 seconds [03:49:48] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:50:51] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:52:21] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [03:56:24] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [03:56:33] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:57:45] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [03:58:21] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [04:00:09] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.563 seconds [04:03:18] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.258 seconds [04:04:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:07:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:14:59] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.341 seconds [04:17:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.385 seconds [04:20:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:21:08] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:30:17] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [04:46:20] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:53] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [04:49:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.296 seconds [04:51:17] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [04:52:56] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [04:53:32] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:54:35] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:56] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:56:05] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:08] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [04:57:17] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [05:00:35] PROBLEM - MySQL Slave Delay on db38 is CRITICAL: CRIT replication delay 196 seconds [05:01:20] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 241 seconds [05:06:44] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.554 seconds [05:11:06] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:15] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.704 seconds [05:26:33] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:27:18] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:06] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [05:53:15] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.856 seconds [05:57:18] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:06:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.843 seconds [06:09:12] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:09:39] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds [06:09:57] RECOVERY - MySQL Slave Delay on db38 is OK: OK replication delay 0 seconds [06:09:57] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.733 seconds [06:10:24] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [06:13:42] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:00] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:30] PROBLEM - MySQL Replication Heartbeat on db52 is CRITICAL: CRIT replication delay 213 seconds [06:21:39] PROBLEM - MySQL Slave Delay on db52 is CRITICAL: CRIT replication delay 221 seconds [06:38:48] RECOVERY - MySQL Slave Delay on db52 is OK: OK replication delay 0 seconds [06:38:48] RECOVERY - MySQL Replication Heartbeat on db52 is OK: OK replication delay 0 seconds [06:42:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 245 seconds [06:43:00] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 254 seconds [06:49:00] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:12] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [06:54:09] New review: Hashar; "Thanks Leslie!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2606 [07:01:09] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:42] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [07:36:18] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [07:36:18] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [07:53:33] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [08:14:37] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:49] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [10:40:42] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [10:44:27] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [10:49:07] New patchset: Catrope; "Define BINDIR in purge-checkuser" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2621 [10:49:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2621 [10:58:18] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2621 [10:58:18] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2621 [11:02:54] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:04:06] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [11:21:18] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 120 MB (1% inode=62%): /var/lib/ureadahead/debugfs 120 MB (1% inode=62%): [11:25:30] RECOVERY - Disk space on srv224 is OK: DISK OK [11:47:17] !log Moved udpmcast unicast-to-multicast HTCP relay from lily to hooft [11:47:20] Logged the message, Master [12:14:15] !log Shutdown lily for decommissioning [12:14:17] Logged the message, Master [12:18:09] PROBLEM - Host lily is DOWN: PING CRITICAL - Packet loss = 100% [13:04:25] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 189 MB (2% inode=62%): /var/lib/ureadahead/debugfs 189 MB (2% inode=62%): [13:09:58] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 194 MB (2% inode=62%): /var/lib/ureadahead/debugfs 194 MB (2% inode=62%): [13:11:19] RECOVERY - Disk space on srv223 is OK: DISK OK [13:37:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2616 [13:37:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2616 [13:37:38] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2619 [13:37:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [13:53:36] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [13:57:39] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:59:36] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [14:25:33] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:39] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [14:30:57] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [14:33:39] PROBLEM - SSH on search1003 is CRITICAL: Connection refused [14:33:57] PROBLEM - RAID on search1003 is CRITICAL: Connection refused by host [14:33:57] PROBLEM - DPKG on search1003 is CRITICAL: Connection refused by host [14:34:24] PROBLEM - Disk space on search1003 is CRITICAL: Connection refused by host [14:37:42] PROBLEM - Lucene on search1003 is CRITICAL: Connection refused [14:43:15] PROBLEM - Disk space on mw40 is CRITICAL: DISK CRITICAL - free space: /tmp 60 MB (3% inode=87%): [14:44:18] RECOVERY - SSH on search1003 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:50:04] RECOVERY - Disk space on mw40 is OK: DISK OK [14:52:19] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [15:02:48] New patchset: Pyoungmeister; "usinga more up to date db list, by roan's suggestion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2622 [15:03:23] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2622 [15:03:24] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2622 [15:03:52] PROBLEM - NTP on search1003 is CRITICAL: NTP CRITICAL: No response from NTP server [15:05:23] merged in a change by tim. going to assume that's ok [15:46:55] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:48:16] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [16:18:11] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:35] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 31.01 ms [16:26:26] PROBLEM - DPKG on search1001 is CRITICAL: Connection refused by host [16:26:35] PROBLEM - Disk space on search1001 is CRITICAL: Connection refused by host [16:27:02] PROBLEM - RAID on search1001 is CRITICAL: Connection refused by host [16:27:32] nagios is funny: Max concurrent service checks (512) has been reached [16:27:47] PROBLEM - SSH on search1001 is CRITICAL: Connection refused [16:30:38] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [16:33:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:35] RECOVERY - SSH on search1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:35:44] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.68473669565 (gt 8.0) [16:37:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.755 seconds [16:45:29] RECOVERY - RAID on search1001 is OK: OK: no RAID installed [16:46:14] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.93967701754 [16:46:14] RECOVERY - DPKG on search1001 is OK: All packages OK [16:46:14] RECOVERY - Disk space on search1001 is OK: DISK OK [16:47:59] robh: can you give me status of sq31, 35, 38 i believe they all have the PCI Training error. [16:48:30] checking sq31 [16:48:51] sq31 is in the installer [16:49:36] thx...ms-be2 is in that rack.. i have the power but if I am pulling those off...i want to keep my power balanced ;] [16:50:57] cmjohnson1: why do you ask, they have error leds? [16:51:18] I am going to drop a ticket to reinstall sq32, the others had the com2 timeout, so resetting their drac to check them out [16:51:35] yes they do [16:53:20] !log rebooting sq35 & sq38, serial console blank [16:53:22] Logged the message, RobH [16:53:40] cmjohnson1: lets see how they come back up =] [16:54:09] cmjohnson1: sq35 confirmed, pci training error [16:54:15] sorry, thats 38 [16:54:28] cmjohnson1: sq35 is refusing to serial console, you may have to connect cart to it [16:54:33] but its offline at the moment, confirmed [16:54:56] k...i will take a look at it [16:55:05] sq38 confirmed pci training error [16:55:11] thats the hdd controller exploding right? [16:55:18] if i recall [16:55:19] yes [16:55:32] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.031 second response time on port 8123 [16:55:47] very common problem with those 1950's [16:56:08] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2438* [16:56:25] cmjohnson1: So it looks like sq31, 32, 35, and 38 are decom [16:56:41] so go ahead and pull and wipe all three of them [16:57:05] well, cannot wipe in the servers, as they are borked, but you know what i mean [16:57:23] okay...plz put it in a ticket...yep.. i have a backup plan for those [16:58:22] !rt 2473 [16:58:22] https://rt.wikimedia.org/Ticket/Display.html?id=2473 [16:58:30] cmjohnson1: ^ [17:08:53] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 172 MB (2% inode=62%): /var/lib/ureadahead/debugfs 172 MB (2% inode=62%): [17:08:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:11] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 268 MB (3% inode=62%): /var/lib/ureadahead/debugfs 268 MB (3% inode=62%): [17:12:47] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 272 MB (3% inode=62%): /var/lib/ureadahead/debugfs 272 MB (3% inode=62%): [17:15:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.952 seconds [17:15:29] RECOVERY - Disk space on srv221 is OK: DISK OK [17:48:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.255 seconds [18:08:19] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:08:59] uhhh, how is 400 == OK? [18:11:01] nagios-wm: ping! [18:12:19] seems to have been like that a while: http://nagios.wikimedia.org/nagios/cgi-bin/history.cgi?host=stafford&service=Puppetmaster+HTTPS [18:13:43] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 30.87 ms [18:16:52] PROBLEM - Disk space on search1002 is CRITICAL: Connection refused by host [18:17:01] PROBLEM - RAID on search1002 is CRITICAL: Connection refused by host [18:17:19] PROBLEM - SSH on search1002 is CRITICAL: Connection refused [18:17:28] PROBLEM - DPKG on search1002 is CRITICAL: Connection refused by host [18:20:46] PROBLEM - Lucene on search1002 is CRITICAL: Connection refused [18:22:22] Ryan_Lane: lemme know when you're ready to "fix" dns [18:22:27] heh [18:22:33] ok. so, what are we actually doing again? [18:23:23] we need to switch wap's dns from pointing to wikipedia-lb.wikimedia.org and repoint it to mobile-lb.eqiad.wikimedia.org [18:23:31] and it uses the language list [18:23:42] so, what's the actual wap address? [18:24:03] $lang.wap.wikipedia.org [18:24:07] ok [18:26:37] RECOVERY - SSH on search1002 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:26:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:50] when you're started on it i'd love to see what you're doing so i can break it next time :) [18:32:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.469 seconds [18:33:40] !log updating $lang.wap.wikipedia.org dns to point to mobile-lb.eqiad.wikimedia.org [18:33:42] Logged the message, Mistress of the network gear. [18:37:28] !log reverting $lang.wap.wikipedia.org dns changes [18:37:30] Logged the message, Mistress of the network gear. [18:39:41] New patchset: Ottomata; "Buncha mini changes + hackiness to parse a few things. This really needs more work" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2623 [18:42:43] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2623 [18:44:55] PROBLEM - NTP on search1002 is CRITICAL: NTP CRITICAL: No response from NTP server [18:48:50] hrm, so dobson seems to be stuck at Reloading zones in PowerDNS [18:49:03] i even tried restarting authdns-update [18:49:09] err [18:49:51] LeslieCarr: what vlan are the labstore hosts in? [18:50:30] the instances will need to have network access to it [18:50:57] they'll need to be able to reach the build host too [18:51:04] at least at first [18:52:01] they're in the normal internal vlan [18:52:24] so we should probably dual network them for now ? [18:52:37] 2nd one in the 103 vlan [18:55:23] well peachy, whatever pdns_control is wanting to talk to, ain't on the other end [18:55:40] strace -p 6949 [18:55:40] Process 6949 attached - interrupt to quit [18:55:40] recvfrom(3, [18:56:42] hrm [18:57:22] according to the intertubes, that's how it should be talking to running pdns servers … and the other pdns servers appear to be running... [18:58:04] well.. you see it doing anything? [18:59:44] oh, look at that - dns request to ns0 timed out [18:59:49] i'll restart pdns [19:02:18] well it's stopped :-D [19:02:37] !log restarted pdns on ns0 [19:02:39] Logged the message, Mistress of the network gear. [19:02:43] PROBLEM - Lighttpd HTTP on dataset2 is CRITICAL: Connection refused [19:02:43] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.020 seconds response time. www.wikipedia.org returns 208.80.154.225 [19:03:01] hrm,why didn't nagios tell us it was down [19:03:20] so the next q is whether the reload went [19:03:34] ah phooey [19:03:56] yay it's back to the old config [19:04:42] !log restarted lighty on dataset2, silly thing [19:04:44] Logged the message, Master [19:04:46] ryan_lane: did you have any more problems with labstore1? [19:04:54] back to the slightly broken config… :) [19:04:59] yeah :-D [19:05:00] oh. I need to build the raidset there [19:05:08]