[00:03:00] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 241 seconds [00:03:27] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 270 seconds [00:04:48] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [00:05:24] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.876 seconds [00:05:24] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.883 seconds [00:09:18] PROBLEM - Lucene on search6 is CRITICAL: Connection timed out [00:09:18] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:18] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:12:00] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.631 seconds [00:12:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.625 seconds [00:13:12] RECOVERY - Lucene on search6 is OK: TCP OK - 8.992 second response time on port 8123 [00:14:51] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [00:15:54] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:54] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.672 seconds [00:19:57] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.699 seconds [00:21:18] PROBLEM - Lucene on search6 is CRITICAL: Connection timed out [00:25:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:39] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [00:25:39] so how do the varnish caches decide to send something to wap on ekrem ? [00:25:58] could we maybe have them send directly to mobile.wikipedia.org ? [00:25:59] they don't [00:26:02] bypassing the server ? [00:26:07] oh ? [00:26:10] varnish doesn't hit that server at all [00:26:15] RECOVERY - Lucene on search6 is OK: TCP OK - 0.002 second response time on port 8123 [00:26:28] so if i'm reading correctly, it looks like most of the requests are coming from 10.64.0.137 (cp1015) [00:31:48] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.865 seconds [00:31:48] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.863 seconds [00:32:04] ok, i only know about varnish re: that stuff. ekrem is a mysterious one off and i'm not sure what squd sends there.. Ryan_Lane? ^^ [00:32:59] also, maybe switch the 302 to a 301 ? another possibly interesting tidbit is that they're all using HTTP/1.0 [00:33:22] squid uses http/1.0 [00:33:34] it's hi-tech like that [00:33:53] ah [00:33:54] :) [00:35:11] !log changing search15 to run regular search-pool2 indexes instead of highlights [00:35:14] Logged the message, Master [00:36:20] preilly: do you know anything about en.wap.wikimedia.org? [00:38:02] binasher: nope [00:38:52] maybe we should leave it down.. [00:40:46] probably should [00:40:56] LeslieCarr: ^^ [00:40:59] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:02] heh [00:41:04] good timing [00:41:07] I'm about to start deleting broken thumbs. [00:41:08] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:13] robla: there's depoly testing stuff going on, right? [00:41:17] well it's up right now - just rather broken [00:41:25] any chance I should hold off because of it? [00:41:31] or is just warning that I'm doing it sufficient? [00:41:41] so i only see one main referrer and that is tracfone [00:41:47] a little bit of yahoo mobile [00:41:53] maplebed: from swift alone? [00:41:54] LeslieCarr: while looking at logs, did you get an idea of what kind of requests are hitting it? othat htan from squid [00:42:07] what's a full url? [00:42:07] AaronSchulz: from swift and from squid. [00:42:15] maplebed: oh, right, of course [00:42:18] "http://us.m2.yahoo.com/w/ygo-onesearch?__redir=1&submit=oneSearch&.tsrc=attosus&first=1&p=wikipedia&bzc=55806" [00:42:25] or most commonly "http://m3.tracfone.com/search" [00:42:30] wap.wikipedia.org:80 10.64.0.137 - - [15/Feb/2012:20:59:39 +0000] "GET / HTTP/1.0" 302 550 "http://m3.tracfone.com/search" "LG-LG231C[TF268435460901625418000000019036500542] UP.Browser/6.2.3.8 (GUI) MMP/2.0 [00:42:44] maplebed: I think it should be fine. [00:42:47] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (24029) [00:42:47] I'd go ahead and watch the graphs [00:42:50] thanks robla. [00:42:54] http://m3.tracfone.com/mobilewebsites [00:43:03] the only bad effect I can forsee is ms5 getting overloaded if I accidentally purge too much too fast. [00:43:05] http://m3.tracfone.com/redir/wikipedia [00:43:09] ok, so en.wap just redirects to en.mobile [00:43:10] I'll watch ms5 and kill if I see issues. [00:43:25] maplebed: before proceeding though... [00:43:28] hehe i saw preilly's request from there :) [00:43:39] I get http://en.mobile.wikipedia.org/ [00:43:48] yeah, then it 302's to en.mobile [00:43:54] maplebed: let's move the conversation to #wikimedia-tech [00:43:57] did it take like 8-10 seconds to finally get there ? [00:44:13] yes [00:44:21] yeah, that's all ekrem delay :( [00:44:43] but i figured if we can push that redirect up to the squids ? [00:44:53] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.856 seconds [00:45:02] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.530 seconds [00:45:19] it would be better to move the dns to point to the mobile varnish cluster [00:45:27] and have varnish do the redirect [00:45:30] totally [00:45:42] LeslieCarr: want to take a shot at that? templates/varnish/mobile-frontend.inc.vcl.erb for the varnish bit [00:45:59] cool [00:48:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:49:05] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:52:32] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [00:52:32] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [00:53:39] ^ that's racist nagios [00:55:46] !log rebooting prototype.wikimedia.org [00:55:48] Logged the message, Master [00:56:53] New patchset: Lcarr; "Adding mobile wap to redirect to new mobile site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2620 [00:57:47] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 204 seconds [00:57:47] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 206 seconds [00:58:15] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2606 [00:58:15] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2606 [01:00:47] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.547 seconds [01:04:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:05] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.260 seconds [01:13:30] New patchset: Lcarr; "Adding mobile wap to redirect to new mobile site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2620 [01:14:08] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:13] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2620 [01:15:13] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2620 [01:18:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.373 seconds [01:19:50] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.946 seconds [01:22:14] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:24:20] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:56] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.831 seconds [01:25:32] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:27:56] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:32:28] binasher: so do we need to restart varnish on the machines or should it pick it up ? [01:32:37] it should pick it up [01:32:42] cool [01:34:14] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.483 seconds [01:34:32] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.402 seconds [01:37:23] PROBLEM - Host formey is DOWN: CRITICAL - Host Unreachable (208.80.152.147) [01:37:59] hrm, so queries are still going through, how would i check which hosts the lvs forwards these on to ? (there's no "wap" section on the list on noc ) binasher ? [01:38:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:26] RECOVERY - Host formey is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [01:38:35] LeslieCarr: after varnish gets updated, next thing is to change dns [01:38:35] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:17] okay, so it currently points to wikipedia-lb.wikimedia.org [01:44:57] mobile-lb.eqiad.wikimedia.org is what it should be instead [01:45:41] i'm iffy on the formatting in the pdns templates for the language A record expansion [01:47:26] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.538 seconds [01:50:35] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [01:50:44] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [01:51:38] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:50] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 196 seconds [01:54:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.382 seconds [01:55:32] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [01:55:59] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 613s [01:56:06] !log 1.19 schema migraitons now running on enwiki slaves [01:56:08] Logged the message, Master [01:56:17] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 628s [02:00:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:14] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.744 seconds [02:01:59] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 216 seconds [02:02:08] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 223 seconds [02:03:38] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.884 seconds [02:05:08] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:13:23] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:14:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:16:00] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:20:05] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [02:21:53] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [02:22:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2619 [02:23:44] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.256 seconds [02:27:38] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:58] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [02:39:47] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.441 seconds [02:43:32] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:44:26] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:35] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:49:41] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.916 seconds [02:53:44] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:56:17] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.652 seconds [03:04:23] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:59] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 5.572 seconds [03:09:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:32] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.927 seconds [03:14:26] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:18:30] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.433 seconds [03:20:17] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.360 seconds [03:25:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:42] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.886 seconds [03:38:42] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [03:38:51] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 0 seconds [03:41:15] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 195 seconds [03:42:09] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 196 seconds [03:42:45] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:45:45] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.149 seconds [03:46:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.472 seconds [03:49:48] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:50:51] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:52:21] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [03:56:24] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [03:56:33] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:57:45] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [03:58:21] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [04:00:09] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.563 seconds [04:03:18] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.258 seconds [04:04:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:07:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:14:59] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.341 seconds [04:17:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.385 seconds [04:20:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:21:08] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:30:17] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [04:46:20] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:53] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [04:49:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.296 seconds [04:51:17] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [04:52:56] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [04:53:32] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:54:35] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:56] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:56:05] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:08] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [04:57:17] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [05:00:35] PROBLEM - MySQL Slave Delay on db38 is CRITICAL: CRIT replication delay 196 seconds [05:01:20] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 241 seconds [05:06:44] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.554 seconds [05:11:06] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:15] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.704 seconds [05:26:33] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:27:18] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:06] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [05:53:15] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.856 seconds [05:57:18] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:06:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.843 seconds [06:09:12] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:09:39] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds [06:09:57] RECOVERY - MySQL Slave Delay on db38 is OK: OK replication delay 0 seconds [06:09:57] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.733 seconds [06:10:24] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [06:13:42] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:00] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:30] PROBLEM - MySQL Replication Heartbeat on db52 is CRITICAL: CRIT replication delay 213 seconds [06:21:39] PROBLEM - MySQL Slave Delay on db52 is CRITICAL: CRIT replication delay 221 seconds [06:38:48] RECOVERY - MySQL Slave Delay on db52 is OK: OK replication delay 0 seconds [06:38:48] RECOVERY - MySQL Replication Heartbeat on db52 is OK: OK replication delay 0 seconds [06:42:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 245 seconds [06:43:00] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 254 seconds [06:49:00] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:12] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [06:54:09] New review: Hashar; "Thanks Leslie!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2606 [07:01:09] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:42] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [07:36:18] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [07:36:18] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [07:53:33] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [08:14:37] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:49] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [10:40:42] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [10:44:27] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [10:49:07] New patchset: Catrope; "Define BINDIR in purge-checkuser" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2621 [10:49:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2621 [10:58:18] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2621 [10:58:18] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2621 [11:02:54] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:04:06] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [11:21:18] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 120 MB (1% inode=62%): /var/lib/ureadahead/debugfs 120 MB (1% inode=62%): [11:25:30] RECOVERY - Disk space on srv224 is OK: DISK OK [11:47:17] !log Moved udpmcast unicast-to-multicast HTCP relay from lily to hooft [11:47:20] Logged the message, Master [12:14:15] !log Shutdown lily for decommissioning [12:14:17] Logged the message, Master [12:18:09] PROBLEM - Host lily is DOWN: PING CRITICAL - Packet loss = 100% [13:04:25] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 189 MB (2% inode=62%): /var/lib/ureadahead/debugfs 189 MB (2% inode=62%): [13:09:58] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 194 MB (2% inode=62%): /var/lib/ureadahead/debugfs 194 MB (2% inode=62%): [13:11:19] RECOVERY - Disk space on srv223 is OK: DISK OK [13:37:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2616 [13:37:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2616 [13:37:38] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2619 [13:37:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [13:53:36] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [13:57:39] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:59:36] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [14:25:33] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:39] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [14:30:57] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [14:33:39] PROBLEM - SSH on search1003 is CRITICAL: Connection refused [14:33:57] PROBLEM - RAID on search1003 is CRITICAL: Connection refused by host [14:33:57] PROBLEM - DPKG on search1003 is CRITICAL: Connection refused by host [14:34:24] PROBLEM - Disk space on search1003 is CRITICAL: Connection refused by host [14:37:42] PROBLEM - Lucene on search1003 is CRITICAL: Connection refused [14:43:15] PROBLEM - Disk space on mw40 is CRITICAL: DISK CRITICAL - free space: /tmp 60 MB (3% inode=87%): [14:44:18] RECOVERY - SSH on search1003 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:50:04] RECOVERY - Disk space on mw40 is OK: DISK OK [14:52:19] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [15:02:48] New patchset: Pyoungmeister; "usinga more up to date db list, by roan's suggestion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2622 [15:03:23] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2622 [15:03:24] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2622 [15:03:52] PROBLEM - NTP on search1003 is CRITICAL: NTP CRITICAL: No response from NTP server [15:05:23] merged in a change by tim. going to assume that's ok [15:46:55] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:48:16] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [16:18:11] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:35] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 31.01 ms [16:26:26] PROBLEM - DPKG on search1001 is CRITICAL: Connection refused by host [16:26:35] PROBLEM - Disk space on search1001 is CRITICAL: Connection refused by host [16:27:02] PROBLEM - RAID on search1001 is CRITICAL: Connection refused by host [16:27:32] nagios is funny: Max concurrent service checks (512) has been reached [16:27:47] PROBLEM - SSH on search1001 is CRITICAL: Connection refused [16:30:38] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [16:33:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:35] RECOVERY - SSH on search1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:35:44] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.68473669565 (gt 8.0) [16:37:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.755 seconds [16:45:29] RECOVERY - RAID on search1001 is OK: OK: no RAID installed [16:46:14] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.93967701754 [16:46:14] RECOVERY - DPKG on search1001 is OK: All packages OK [16:46:14] RECOVERY - Disk space on search1001 is OK: DISK OK [16:47:59] robh: can you give me status of sq31, 35, 38 i believe they all have the PCI Training error. [16:48:30] checking sq31 [16:48:51] sq31 is in the installer [16:49:36] thx...ms-be2 is in that rack.. i have the power but if I am pulling those off...i want to keep my power balanced ;] [16:50:57] cmjohnson1: why do you ask, they have error leds? [16:51:18] I am going to drop a ticket to reinstall sq32, the others had the com2 timeout, so resetting their drac to check them out [16:51:35] yes they do [16:53:20] !log rebooting sq35 & sq38, serial console blank [16:53:22] Logged the message, RobH [16:53:40] cmjohnson1: lets see how they come back up =] [16:54:09] cmjohnson1: sq35 confirmed, pci training error [16:54:15] sorry, thats 38 [16:54:28] cmjohnson1: sq35 is refusing to serial console, you may have to connect cart to it [16:54:33] but its offline at the moment, confirmed [16:54:56] k...i will take a look at it [16:55:05] sq38 confirmed pci training error [16:55:11] thats the hdd controller exploding right? [16:55:18] if i recall [16:55:19] yes [16:55:32] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.031 second response time on port 8123 [16:55:47] very common problem with those 1950's [16:56:08] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2438* [16:56:25] cmjohnson1: So it looks like sq31, 32, 35, and 38 are decom [16:56:41] so go ahead and pull and wipe all three of them [16:57:05] well, cannot wipe in the servers, as they are borked, but you know what i mean [16:57:23] okay...plz put it in a ticket...yep.. i have a backup plan for those [16:58:22] !rt 2473 [16:58:22] https://rt.wikimedia.org/Ticket/Display.html?id=2473 [16:58:30] cmjohnson1: ^ [17:08:53] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 172 MB (2% inode=62%): /var/lib/ureadahead/debugfs 172 MB (2% inode=62%): [17:08:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:11] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 268 MB (3% inode=62%): /var/lib/ureadahead/debugfs 268 MB (3% inode=62%): [17:12:47] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 272 MB (3% inode=62%): /var/lib/ureadahead/debugfs 272 MB (3% inode=62%): [17:15:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.952 seconds [17:15:29] RECOVERY - Disk space on srv221 is OK: DISK OK [17:48:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.255 seconds [18:08:19] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:08:59] uhhh, how is 400 == OK? [18:11:01] nagios-wm: ping! [18:12:19] seems to have been like that a while: http://nagios.wikimedia.org/nagios/cgi-bin/history.cgi?host=stafford&service=Puppetmaster+HTTPS [18:13:43] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 30.87 ms [18:16:52] PROBLEM - Disk space on search1002 is CRITICAL: Connection refused by host [18:17:01] PROBLEM - RAID on search1002 is CRITICAL: Connection refused by host [18:17:19] PROBLEM - SSH on search1002 is CRITICAL: Connection refused [18:17:28] PROBLEM - DPKG on search1002 is CRITICAL: Connection refused by host [18:20:46] PROBLEM - Lucene on search1002 is CRITICAL: Connection refused [18:22:22] Ryan_Lane: lemme know when you're ready to "fix" dns [18:22:27] heh [18:22:33] ok. so, what are we actually doing again? [18:23:23] we need to switch wap's dns from pointing to wikipedia-lb.wikimedia.org and repoint it to mobile-lb.eqiad.wikimedia.org [18:23:31] and it uses the language list [18:23:42] so, what's the actual wap address? [18:24:03] $lang.wap.wikipedia.org [18:24:07] ok [18:26:37] RECOVERY - SSH on search1002 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:26:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:50] when you're started on it i'd love to see what you're doing so i can break it next time :) [18:32:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.469 seconds [18:33:40] !log updating $lang.wap.wikipedia.org dns to point to mobile-lb.eqiad.wikimedia.org [18:33:42] Logged the message, Mistress of the network gear. [18:37:28] !log reverting $lang.wap.wikipedia.org dns changes [18:37:30] Logged the message, Mistress of the network gear. [18:39:41] New patchset: Ottomata; "Buncha mini changes + hackiness to parse a few things. This really needs more work" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2623 [18:42:43] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2623 [18:44:55] PROBLEM - NTP on search1002 is CRITICAL: NTP CRITICAL: No response from NTP server [18:48:50] hrm, so dobson seems to be stuck at Reloading zones in PowerDNS [18:49:03] i even tried restarting authdns-update [18:49:09] err [18:49:51] LeslieCarr: what vlan are the labstore hosts in? [18:50:30] the instances will need to have network access to it [18:50:57] they'll need to be able to reach the build host too [18:51:04] at least at first [18:52:01] they're in the normal internal vlan [18:52:24] so we should probably dual network them for now ? [18:52:37] 2nd one in the 103 vlan [18:55:23] well peachy, whatever pdns_control is wanting to talk to, ain't on the other end [18:55:40] strace -p 6949 [18:55:40] Process 6949 attached - interrupt to quit [18:55:40] recvfrom(3, [18:56:42] hrm [18:57:22] according to the intertubes, that's how it should be talking to running pdns servers … and the other pdns servers appear to be running... [18:58:04] well.. you see it doing anything? [18:59:44] oh, look at that - dns request to ns0 timed out [18:59:49] i'll restart pdns [19:02:18] well it's stopped :-D [19:02:37] !log restarted pdns on ns0 [19:02:39] Logged the message, Mistress of the network gear. [19:02:43] PROBLEM - Lighttpd HTTP on dataset2 is CRITICAL: Connection refused [19:02:43] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.020 seconds response time. www.wikipedia.org returns 208.80.154.225 [19:03:01] hrm,why didn't nagios tell us it was down [19:03:20] so the next q is whether the reload went [19:03:34] ah phooey [19:03:56] yay it's back to the old config [19:04:42] !log restarted lighty on dataset2, silly thing [19:04:44] Logged the message, Master [19:04:46] ryan_lane: did you have any more problems with labstore1? [19:04:54] back to the slightly broken config… :) [19:04:59] yeah :-D [19:05:00] oh. I need to build the raidset there [19:05:08] but hey it's serving! [19:05:08] instead of completely broken [19:05:15] hehe [19:05:16] yep :) [19:05:16] RECOVERY - Lighttpd HTTP on dataset2 is OK: HTTP OK HTTP/1.0 200 OK - 4906 bytes in 0.022 seconds [19:05:24] so how is athens right now ? [19:05:31] um [19:05:38] that's the 64 thousand dollar question [19:05:41] or is that [19:05:49] 93 billion euros... [19:06:22] the rumor mill really is flying these days, and every time we get close to a deal the target gets moved [19:06:31] so I really have no idea, neither does anyone else I guess [19:06:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:46] really? [19:06:51] * apergos glares at it [19:09:00] load average 63?? [19:09:11] dude, it's been going crazy, like spiking up and down [19:09:21] awesome :-/ [19:09:36] cmjohnson1: same problem [19:10:03] maybe 16 cpu's isnt enough - do any of the new machines have more ? [19:10:05] RobH: ? [19:10:14] ? [19:10:18] cmjohnson1: when I made the raid, it's saying degraded, and the disk goes between rebuild and missing [19:10:22] LeslieCarr: some do [19:10:38] like apergos enwiki xml snapshot machines [19:10:40] Dynamic lookup of $gid at /var/lib/git/operations/puppet/manifests/admins.pp:975 is deprecated. Support will be removed in Puppet 2.8. Use a fully-qualified variable name (e.g., $classname::variable) or parameterized classes. [19:10:41] but thats about it [19:10:43] same disk? [19:11:11] cmjohnson1: 01:00:02: [19:11:12] yep [19:11:15] taking up a bunch of space in the lgo but it shouldn't make things so much slower [19:11:31] snapshot4 and snapshot-sometingorother in eqiad [19:11:38] okay...i will phone Dell and see if we can that disk replaced [19:11:43] those have 32 cores (4 8-core I think) [19:11:45] thabks [19:11:47] *thanks [19:12:25] hrm [19:12:34] with row c coming… [19:12:40] think we should move up ? [19:12:50] apergos: snapshot1001 [19:13:05] I feel like ever beefier puppet hosts is the wrong move [19:13:11] somehow we gotta split the workload up [19:13:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.594 seconds [19:13:43] we have a dedicated host for puppet in eqiad its just not setup [19:13:49] rubidium [19:13:54] that's the way to do it [19:13:56] but its a high performance misc server [19:14:07] so only 12 cpu cores. [19:14:10] uhh [19:14:13] only, heh. [19:14:26] well we'll see when it handles eqiad only, maybe that will be enough [19:14:36] honestly 12 cores for puppet? it's not like we have a gazillion hosts [19:16:15] oh [19:16:29] well yeah, if we split it up it should be okay [19:16:47] but puppet is written in ruby ;) [19:16:50] * apergos gripes some about [19:16:54] rats you are faster than me [19:16:57] ... ruby :-D [19:16:59] * AaronSchulz wonders which dbs the api uses for slaves [19:18:18] and it's not in the cool dbtree graph! [19:20:27] cmjohnson1: when do you want to do the wap installs ? [19:22:42] so, the dumb q, are the apis actually split off to separate dbs? cause... [19:23:42] I kinda think not, there's just a separate pool of app servers [19:24:15] New patchset: Lcarr; "Moving generic::tcptweaks to "standard" server class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2624 [19:26:04] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2624 [19:26:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2624 [19:35:23] if there's another stafford alarm that one is probably me :) [19:36:10] cool [19:36:19] (it's really past mym worktime anyways) [19:40:21] !log running /etc/network/if-up.d/initcwnd on the apaches [19:40:23] Logged the message, Mistress of the network gear. [19:40:44] yay [19:40:54] i love it when everything just works :) [19:41:15] :-) [19:46:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.942 seconds [20:03:16] lesliecarr: want to do it now? [20:10:49] PROBLEM - Disk space on mw15 is CRITICAL: DISK CRITICAL - free space: /tmp 18 MB (1% inode=87%): [20:25:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:16] RECOVERY - Disk space on mw15 is OK: DISK OK [20:30:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.918 seconds [20:33:55] PROBLEM - HTTP on singer is CRITICAL: Connection refused [20:34:07] looks like ganglia is broken [20:42:19] wtf [20:42:28] php was delete from nickel [20:42:29] rc libapache2-mod-php5 5.3.2-2wm1 server-side, HTML-embedded scripting languag [20:46:46] woosters: https://rt.wikimedia.org/Ticket/Display.html?id=2452 ... I can haz some help? [20:48:44] let me look into it and get back to u [20:48:47] hexmode [20:49:02] woosters: ty :) [20:49:27] woosters: if we could get it done before we roll out 1.19 further, that'd be a bonus :) [20:50:29] cmjohnson1: yo [20:50:33] binasher: ganglia broken ? [20:51:04] Lesliecarr: hi [20:51:23] sorery i was at lunch [20:51:30] puppet auto updating apache caused it to remove php [20:51:48] np...i missed u the first time....ms-be's are all set for you [20:51:56] ensure => latest strikes again [20:52:25] oh noes puppet [20:52:28] bad puppet [20:52:36] heh [20:52:45] wow. that's kind of crazy [20:52:51] is singer http meant to be down? [20:52:53] so binasher the dns change worked perfectly…. however all queries were then redirected to incubator.wikimedia.org [20:52:55] you know what we need? [20:53:03] a package management system [20:53:07] hahahaha [20:53:07] a puppet-stabbing knife? [20:53:08] fucking canonical [20:53:26] LeslieCarr: did you change the dns back? [20:53:50] i did [20:54:21] Warning: DocumentRoot [/srv/org/wikimedia/url/] does not exist [20:54:24] from singer [20:54:28] so, no http [20:54:34] maybe when we get the new labs ops person in I can spend some time to write an opensource package management system [20:54:52] LeslieCarr: looks like mediawiki / apache does that redirect [20:55:05] so you'll have to actually have varnish serve a redirect to mobile instead of doing the rewrite [20:55:16] yeah, that's apache2 being updated over there. [20:55:40] do the prod app servers also have apache2 on ensure latest? [20:56:28] Ryan_Lane: serious question from a non-debian-world user, is apt not a package-management system? [20:58:52] ok, i think the prod appservers don't get apache installed directly via puppet, so they should be safe [20:59:58] Jeff_Green: it is. [21:00:05] for a single system [21:00:07] I mean for for an entire network [21:00:13] *one for [21:00:35] ah i see [21:01:43] landscape, for instance is one of these [21:01:51] it's a canonical product, and is closed source [21:02:00] we could make huge improvements even just by working more intentionally with packaging [21:02:10] how so? [21:02:17] it's the thing we do the *worst* right now [21:02:18] binasher: so should i do a redirect doing something like this ? https://www.varnish-cache.org/trac/wiki/VCLExampleRedirectInVCL [21:02:23] we have no way of tracking it [21:02:25] Ryan_Lane: yeah, that's what I mean [21:02:48] using built-in dependency management for one [21:02:55] actually I wonder what is still on singer [21:02:59] well, for our own custom packages we do that [21:03:05] most of this crap in here is not really on this host any more [21:03:11] LeslieCarr: there should be a redirect in the vcl already, do it like that [21:03:14] though we've been getting away from packaging most things to using puppeyt [21:03:16] *puppet [21:03:28] we package software, then configure it with puppet. it's a saner model [21:03:47] Ryan_Lane: right but we also put dependencies in puppet [21:03:51] LeslieCarr: search for 66 [21:03:53] 666 [21:03:59] Jeff_Green: only kind of [21:04:13] dang. it still is [21:04:14] for example we could use meta packages for services [21:04:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:04:20] eww [21:04:22] no thanks [21:04:25] ah [21:04:39] we also have no automated build system [21:04:52] Ryan_Lane: yes, that's necessary [21:04:57] we've oriented our development model towards puppet and away from packaging [21:05:06] we want an automated build system. there's plans for it [21:05:13] but, we have to have time to get to it [21:05:15] I'm coming from a shop where we built an automated build system and used meta packages [21:05:18] yeah [21:05:23] I dislike meta packages [21:05:23] binasher: so can we do the \1 \2 bits with the redirect, to preserve the language ? [21:05:23] and it was a hell of a lot smoother and more consistent [21:05:33] understandable, and I'm not saying that's the only way [21:05:38] * Ryan_Lane nods [21:05:46] I'd love to have an automated build system, though [21:05:56] but we have a dependency tree for system software, and at the interface between system software and service software [21:06:18] what did url.wikimedia.org do? [21:06:20] yeah. our puppet problems don't usually stem from that, though [21:06:24] so we shove puppet in between those, and work around the glitches with pinning and whatnot [21:06:38] we use pinning because we use our repo poorlt [21:06:40] *poorly [21:06:47] yeah [21:06:58] for it to work really well you have to take control of the whole shebang [21:07:06] LeslieCarr: you could still rewrite the host header and then: error 666 "http://" + req.http.host + req.url; [21:07:19] none of this is an issue with how we're using puppet, though [21:07:29] it's a problem with no one taking the time to fix the apt issues :) [21:07:40] ah cool :) [21:07:45] good idea [21:08:04] Jeff_Green: the fundraiser is over, want to fix our apt issues? :D [21:08:28] Ryan_Lane: sure! [21:08:31] \o/ [21:09:10] i'm planning to do this for the payments cluster anyway, perhaps we can use it as a model and/or test bed [21:09:15] cool [21:09:19] um, folks... I am actually serious about this. singer apache is failing to restart because it wants the docroot /srv/org/wikimedia/url/ for url.wikimedia.org, and I have no idea what that service is or did, or what should be in there [21:09:21] LeslieCarr: am i plugging in the same static LAN info I already have? [21:09:21] I wanted to build the build system in labs [21:09:25] does anyone know? [21:09:38] oh [21:09:39] LeslieCarr: [21:09:43] Jeff_Green: and in labs, anyone can build a package and have it added to a project specific repo [21:09:50] the original varnish change was a fail [21:09:53] cmjohnson1: what's the static lan info you already have? :) [21:10:01] Message from VCC-compiler: [21:10:01] Syntax error at [21:10:01] ('mobile-frontend.inc.vcl' Line 90 Pos 106) [21:10:02] set req.http.host = regsub( req.http.host, "^([a-zA-Z0-9-]+)\.wap\.([a-zA-Z0-9-]+)\.org" \1.mobile.\2.org); [21:10:02] ---------------------------------------------------------------------------------------------------------#----------------- [21:10:02] Running VCC-compiler failed, exit 1 [21:10:20] then, we can code-review the package, and move it into a common labs repo [21:10:22] ah well that would also kill it… [21:10:23] hrm [21:10:27] Ryan_Lane: ok. I have a lot to learn about labs. [21:10:28] it would automatically get signed for labs [21:10:37] then we can choose to move it to production [21:10:40] oh forgot a comma [21:10:40] where we'd manually sign it [21:10:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.953 seconds [21:10:50] Jeff_Green: I can give you a quick overview [21:10:59] ok [21:11:09] labs is based on openstack [21:11:15] openstack has a concept of "tenancy" [21:11:35] openstack nova calls tenants projects. I'll only use the term project ;) [21:12:13] a project is basically a collection of resources that is separated, security wise, from other projects [21:12:24] people can be members of projects, and have roles in the projects [21:12:32] what does the separation look like? [21:12:38] network address space? tools scope? [21:12:39] New patchset: Lcarr; "Fixing varnish language" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2625 [21:12:46] binasher: ^^ [21:12:49] from the network perspective, it's firewall rules [21:12:55] deny by default [21:13:08] PROBLEM - Varnish HTTP mobile-frontend on cp1042 is CRITICAL: Connection refused [21:13:10] this is at the virtualization layer [21:13:14] ok [21:13:15] LeslieCarr: looks good, sorry i missed that reviewing it before [21:13:26] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2625 [21:13:26] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2625 [21:13:47] inside of projects, people in sysadmin role can create/delete instances (virtual machines) [21:14:12] netadmins can modify the security groups (firewall rules), and manage public IPs and public DNS [21:14:22] no prob, commas are the bane of life [21:14:46] Ryan_Lane: but netadmins modify within a project? [21:14:47] so, in labs, we are trying to work with the multi-tenancy concept as much as possible, because it's fairly powerful [21:14:57] yes. netadmin and sysadmin are per-project [21:15:11] New patchset: Diederik; "Added full support for ip address and ip range filtering Added full support for regular expression matching" [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2626 [21:15:24] so, we try to give people full control over everything in a project [21:15:39] meaning, they can make all changes we'd make inside of production [21:15:54] so, if we have an automated build system and custom repos, a project should too [21:15:58] same with puppet [21:16:09] we're planning a puppet branch per project [21:16:14] where people can bypass review [21:16:21] right [21:16:29] they'll merge into test, then we'll cherry-pick to production [21:16:32] !log stopping mysql and apache on searchidx2... not sure why they are there. also, going to clean up some packages... like the ubuntu version of mediawiki [21:16:35] Logged the message, and now dispaching a T1000 to your position to terminate you. [21:16:57] the key is that we should be able to code-review before project-specific things become labs-wide and production-wide [21:17:02] same with packaging [21:17:10] we were planning a git repo per package [21:17:18] makes sense [21:18:05] Ryan_Lane: question: will packages necessarily be .deb in this model, or will it be agnostic? [21:18:14] deb [21:18:45] I can't see a reason we'd need another [21:18:51] we only support ubuntu [21:19:01] depends what the goal is [21:19:13] but anyway, I was just curious [21:19:41] can we support various OS installs on the virtual machines? [21:19:59] Is secure.wiikimedia.org supposed to be down? [21:20:11] i mean different ubuntu releases for example [21:20:37] Jeff_Green: yep [21:20:51] Jeff_Green: right now we have lucid, natty, and maverick [21:20:55] or [21:21:00] cool [21:21:00] no. oneiric [21:21:05] we'll have precise soon [21:21:14] we can have anything we want, really, thoguh [21:21:15] *though [21:21:30] so perhaps next week I can get started with one or two projects [21:21:44] I want to get a search cluster going [21:21:51] heh [21:21:52] and one for otrs [21:22:00] there's already a project for otrs [21:22:06] the otrs guy has access to it [21:22:11] I can add you in [21:23:00] btw for humors sake i was just complaining about lucene to a CL friend and he said "omg sphinx is the only way, we do 400 million queries a day on that thing" [21:23:17] * Ryan_Lane has no clue about search [21:23:25] I think we have a search project or two as wel [21:23:27] *well [21:23:35] !projects [21:23:35] https://labsconsole.wikimedia.org/wiki/Special:Ask/-5B-5BResource-20Type::project-5D-5D/-3FMember [21:23:38] * Jeff_Green spend the better part of today trying to understand it [21:23:43] <3 wm-bot [21:23:44] casey: no not really but no one can seem to tell me what should be in the files that are gone on singer now. or whether I could just disable the broken piece. or anything. so it's logged, and it's down. [21:24:04] casey: what's secure? [21:24:07] :D [21:24:08] back to the otrs guy--how did he end up with an account/project? [21:24:21] hm. I don't actually see a project [21:24:34] I guess he didn't get one yet [21:24:40] but he's supposed to be helping us out [21:24:55] yeah this is true, and I know we're planning to set him up [21:25:08] seems he doesn't have an account or project :D [21:25:18] I'll make one for you, and add you in [21:25:24] wait, don't! [21:25:27] ok [21:25:28] apergos: okay, as long as it's logged. :-) I clicked on a link to OTRS using the old URL (secure.wikimedia.org/otrs/index.pl) and saw it wasn't up. I just wanted to make sure that we didn't intentionally take it down for good without leaving up a redirect. Thanks! [21:25:34] it'll be good for me to get that exposure [21:25:38] sounds good [21:25:44] lemme make sure you have the right permissions to do so [21:25:54] we're still meeting tomorrow about all that right? [21:25:56] sure. [21:26:03] err the otrs volunteer thing i mean [21:26:04] wish I could actually do something useful about it [21:26:08] we are? [21:26:15] re. data privacy issues [21:26:17] am I in that meeting? [21:26:18] oh [21:26:19] that [21:26:20] yes [21:26:42] i wanted to get that answered up front for him [21:27:01] so we can map out how the upgrade project will go [21:27:23] true [21:27:33] oh which reminds me, we're going to need to stage the data migration because that's the thing that's going to hurt and cause downtime during the actual upgrade [21:27:34] I don't necessarily have a problem with private data in labs [21:28:01] as long as everyone in the project knows that they can't add other project members [21:28:09] and all cloudadmins also know this [21:28:21] right--but there's also the question of org policy on volunteers/contractors and private data [21:28:22] and that firewall rules are properly made [21:28:26] yeah [21:29:09] how closely can we make a vm behave like a db machine? [21:29:21] as close as you'd like [21:29:32] it's a vm, though [21:29:36] so performance will be shit [21:29:48] right, and that's an important part of the test [21:29:50] you'll also need to use /mnt rather than /a [21:30:01] we do have some db hardware that's supposed to be for labs [21:30:04] basically we're going to need to know how many hours we're killing OTRS or whatever [21:30:11] there's no rackspace for it right now though :( [21:30:33] ok [21:30:49] i'll need to study and weigh some options there, then [21:31:15] might be better to do the test tranforms on the otrs slave db to get the timings measured [21:31:39] well, if we can get the db hardware racked in time, you can use that [21:31:43] ok [21:32:01] eventually we need to make that multi-project too, but that's a project all in itself [21:32:17] for now we'll just make dbs manually [21:32:22] * Ryan_Lane shudders [21:33:15] make dbs on a mysql instance? [21:33:38] nah. on real hardware [21:33:53] just wait until people start asking for meaningful test dataset [21:33:58] er datasets [21:34:14] we already have a bunch of people requesting dbs on real hardware [21:34:26] yeah [21:40:36] woosters: another weird one. not sure how important: https://rt.wikimedia.org/Ticket/Display.html?id=2478 [21:40:37] [21:41:30] it's because we stuck varnish in front, and it isn't varying for mobile [21:42:00] binasher: ^^ [21:42:24] I'm not sure how we're doing so for the normal mobile site, but would it be possible for wordpress as well? [21:43:41] the wordpress mobile ext would have to set the write headers to vary between the two, if the urls aren't different [21:43:58] what's an example url to get the mobile version? [21:44:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:44:16] I don't think the url is different [21:44:38] I think it detects a mobile browser and sends out different content [21:47:13] binasher: where can i check to see if my varnish language failed ? [21:49:27] LeslieCarr: ok, i'm running a puppet -t on cp1042 [21:49:55] oh, i'm running as well :) if it overlaps feel free to kill mine [21:50:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.383 seconds [21:52:14] LeslieCarr: now missing a ) :( [21:53:06] binasher: what line ? [21:53:54] line 89 [21:54:24] its the line above the line you change in the last commit.. i was staring at the highlighted portion and not seeing it [21:55:15] ahha [21:55:39] New patchset: Lcarr; "Fixing ")" for wap redirection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2627 [21:56:18] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2627 [21:56:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2627 [21:59:05] binasher: i think it looks good now, trying the dns change again :) [21:59:29] RECOVERY - Varnish HTTP mobile-frontend on cp1042 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.062 seconds [22:00:19] LeslieCarr: yup it looks good - i sent a test en.wap.wikipedia.org request to varnish and got back a page instead of a redirect [22:00:58] cool :) [22:02:53] notpeter, i have a bit of time now, so here is this machine causing a problem? [22:03:05] LeslieCarr: thanks for your merge yesterday (was on a jenkins class IIRC) [22:03:11] yep [22:03:12] yw [22:17:18] rainman-sr: hi! [22:17:25] hello [22:17:39] so, if you take a look at search1001 [22:17:46] (you should have an account [22:17:48] ) [22:18:29] does it look properly setup? [22:18:45] (it's currently serving up eswiki search, as that one is not sharded) [22:19:03] cannot log in, asks me for password [22:19:15] hrm, lemme check that out [22:20:02] ah, puppet has not put your key there yet. I shall do so by hand. one second [22:21:11] puppetd -t not working to put the key in notpeter ? [22:21:37] LeslieCarr: no. but that might be a position of the moon thing [22:21:37] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:22:11] rainman-sr: ok, give it a shot now [22:22:28] nop, still the same [22:22:34] does it have home mounted? [22:22:47] i access via my public ssh key [22:22:59] no. I'm trying to make search not reliant on nfs [22:23:27] !log restarting pdns ns2 [22:23:29] Logged the message, Master [22:23:30] well then i won't have access unless you copy my keys from /home/rainman/.ssh [22:23:38] that's what I just did [22:24:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:24:10] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.115 seconds response time. www.wikipedia.org returns 208.80.154.225 [22:24:12] hmm well still prompts me for password [22:24:37] and you can log into fenari ok? [22:26:34] !log adding search15 to search-pool2 lvs vip [22:26:36] Logged the message, Master [22:29:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.662 seconds [22:29:58] rainman-sr: arg. ok. sorry. try again, plz [22:30:02] should work now [22:30:10] ah works now [22:30:20] sorry about that [22:31:45] no problems [22:32:24] so you set indexer to searchidx2.ptmpta and it doesn't work? [22:32:31] can you do it now? [22:32:31] correct [22:32:38] yes [22:32:42] that is the case currently [22:33:00] and the error in question is in the log file [22:33:21] confs are in /a/search/conf at this poitn [22:33:24] *point [22:33:30] yep [22:33:36] and you restarted the deamon? [22:34:01] no. just did so [22:34:01] done [22:34:16] there we go [22:34:32] Error invoking remote method getIndexTimestamp() on host searchidx2.pmtpa.wmnet : Unknown host: searchidx2 [22:34:52] hmmm [22:35:14] !log db1035 died 2 days ago, attempting to power cycle [22:35:16] Logged the message, Master [22:35:25] New patchset: Bhartshorne; "trying Tim's suggestion of abandoning chunked-encoding for swift puts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2628 [22:35:34] also, same error if IP of indexer is used [22:36:20] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2628 [22:36:20] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2628 [22:38:14] notpeter, well that is strange. [22:38:18] can you put logging on debug? [22:38:37] New patchset: Hashar; "puppet local linter!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2629 [22:39:10] RECOVERY - Host db1035 is UP: PING OK - Packet loss = 0%, RTA = 27.08 ms [22:39:39] LeslieCarr: Isn't https://gerrit.wikimedia.org/r/#change,2629 something you already wrote? [22:40:04] yeah, it's in the base branch called puppet-lint [22:40:11] hashar: ---^^ [22:40:12] but it's pretty basic, if this is nicer, then yay [22:40:40] rainman-sr: gave you ownershit of configs [22:40:46] *ownership [22:41:16] can i also restart the deamon? [22:41:29] or can you do it [22:42:01] PROBLEM - mysqld processes on db1035 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:42:17] it's running as the lsearch user, so you might be able to? [22:42:17] otherise I can [22:42:20] is now restarted [22:42:36] ok thanks, i got what i needed [22:42:52] Calling getIndexTime([eswiki]) on searchidx2.pmtpa.wmnet [22:43:41] that means that on lucene end everything is fine [22:43:50] it seems there is some java problem with resolving that address [22:43:58] i.e. from java api [22:44:12] because after that debug message i'm just calling java api to get a remote RMI registry [22:45:49] actually to call RMI on that host [22:47:16] RECOVERY - mysqld processes on db1035 is OK: PROCS OK: 1 process with command name mysqld [22:48:38] hrm, alright. what do you think a reasonable workaround would be? [22:48:56] an /etc/hosts hack works, but might there be something better? [22:51:10] PROBLEM - MySQL Slave Running on db1035 is CRITICAL: CRIT replication Slave_IO_Running: No Slave_SQL_Running: No Last_Error: Rollback done for prepared transaction because its XID was not in the [22:51:55] rainman-sr: they are running slightly different versions of java. it seems unlikely, but could that be an issue? [22:53:20] notpeter, i doubt it [22:53:49] yeah. seems unlikely [22:55:22] well i guess googling around, maybe finding some kind of java app to test if the name resolving works properly [22:55:43] i had it before that java is a bit stupid in resolving names [22:55:49] sometimes it cannot even get hostname right [22:57:49] ok, cool, I can look into this more [22:58:12] my other big question is: what packages are needed for the indexer to work properly? [22:58:40] in addition to whatever would be present on any other search node [23:03:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:04:13] RECOVERY - MySQL Slave Running on db1035 is OK: OK replication [23:04:43] !log db1035 is fubar after crashing during schema migrations, running a hotbackup from db1019 [23:04:45] Logged the message, Master [23:06:55] PROBLEM - mysqld processes on db1035 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:08:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.934 seconds [23:09:14] notpeter, not sure offhand, i guess only curl [23:10:27] !log truncated 4 tables on db40 [23:10:28] Logged the message, Master [23:10:52] rainman-sr: anything for language support? [23:11:12] or, character support, I suppose [23:14:49] notpeter, nope, i don't think so. it needs access to mediawiki for message files to read localization, and to initialisesettings to read some global configs [23:42:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:48:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.844 seconds [23:54:19] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [23:57:19] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours