[00:01:06] Saibo: https://test2.wikipedia.org/wiki/Special:UnreviewedPages [00:01:50] hexmode: can I get editor/reviewer privileges on test2 also? [00:02:16] chrismcmahon: I can't give reviewer, but editor, sure [00:02:28] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:30] hexmode: I'll take what I can get, thanks :) [00:02:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:57] chrismcmahon: wiki user? [00:03:33] Reedy: gadgets should show up in preferences after -definitions is updated right? [00:03:37] hexmode: I'm using "ChrisMcMähon" as a disposable account [00:03:48] Sometimes needs piurg [00:03:50] purge [00:03:56] note the umlaut [00:03:57] lol @ ä [00:04:27] from the little-known German branch of the McMahon family. [00:04:41] you got it now [00:04:45] thanks [00:05:38] https://test2.wikipedia.org/wiki/Special:RecentChanges deletion log MediaWiki:Edittools shows up in blue although it is deleted and has no default content?! [00:05:55] scrap that.. [00:06:02] "" is default content ;) [00:07:35] Reedy: purged and still nothing.... going to trim -definition [00:07:52] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.678 seconds [00:11:06] Reedy: trimmed, purged, and... nothing :( [00:11:09] !log reedy synchronized php-1.19/extensions/Contest/Contest.php 'revert live hack' [00:11:12] Logged the message, Master [00:11:18] * hexmode is getting ready to file a bug [00:11:49] What are you trying to do? [00:12:19] edit MediaWiki:Gadgets-definition to get gadgets to show up in prefs [00:12:43] I saw this on beta but put it down to squid [00:13:33] * AaronSchulz reads swift docs [00:14:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:44] https://answers.launchpad.net/swift/+question/160673 [00:16:10] it's probably bs that rewrite.py is using chunked-transfer btw ;) [00:16:10] hexmode: could you run the revision report again? replag is only an hour [00:16:23] (which is less than our rev report lag) [00:17:02] just ran it... checking [00:17:14] maybe I screwed up the script [00:17:41] nope, much better [00:17:54] * Reedy tests locally [00:20:04] maplebed: seems that etag should work, worst case if it doesn't is that we check the response ETag of the put and delete if they don't match [00:20:32] (PUTs give etag in the response) [00:21:15] maplebed: btw, thumb-handler is rooted up :) [00:21:25] makes sense. [00:21:25] hexmode: works fine locally [00:21:34] so it's seemingly some cache issue [00:21:57] still, needs a bug report [00:22:00] I'll do it [00:22:02] yeah [00:22:16] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.671 seconds [00:22:17] but if it didn't work at all, it'd be specifically a gadget bug [00:22:46] meh, I should prolly change the svn one anyway [00:22:54] kind of the point, right? [00:23:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.185 seconds [00:24:49] first error: includePage is not defined - probably a long deprecated function ;) [00:24:50] hexmode: this feels like something Max fixed recently... [00:25:14] But maybe not for this [00:26:05] wtf? includeScript is not defined [00:26:06] Also, Special:Gadgets needs an "export all these gadgets" that includes MW:Gadget-definitions [00:26:10] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:16] New patchset: Bhartshorne; "adding etag awareness to abort failed puts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2598 [00:26:18] Saibo: again [00:26:20] ? [00:26:26] def WTF [00:26:26] AaronSchulz: https://gerrit.wikimedia.org/r/2598 [00:26:31] wait.. that is strange.. :D [00:26:54] I have a includescript in a script and I get an error on the error console. [00:27:00] AaronSchulz: is there any way we can test thumb_handler before throwing it on ms5? [00:27:09] hexmode: but: I have other includescripts in my monobook.js - they work [00:27:14] ..let me check.. [00:27:31] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:27:40] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:56] hexmode: oh.. it is not includeScript but importScript .. doh! ;) That happend when I tried to fix the "includePage" [00:27:56] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2598 [00:28:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:28:54] Saibo: sounds too much like https://bugzilla.wikimedia.org/show_bug.cgi?id=34147 [00:29:47] well.. yes, it seems "includePage" is not supported anymore. However, I still had it in a script - which failed.. [00:30:07] oh... ok [00:30:33] but that is "okay" - afaik that is deprecated for very long an easy to replace by importScript [00:30:54] maplebed: I want to read some swift code first [00:32:34] hexmode: found the bug [00:32:44] AaronSchulz: if we push your change to ms5, we can test this code on the eqiad cluster. [00:32:45] hexmode: it's not getting updated in memc when saving [00:32:48] seems to work ok locally [00:32:50] Reedy: :) [00:32:57] maplebed: ok [00:32:58] ... oh wait.. we can test it in tampa since swift is out of service. [00:32:59] so nevermind. [00:33:19] it would be funny if the etag was for each chunk, in which case everything would always fail [00:33:23] Reedy: should I still file the bug so you can have something to close? [00:33:56] yeah, sure [00:34:06] I might have to leave it to max [00:34:07] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:34:15] good, going to do that right now [00:34:22] was busy with gadget import [00:35:10] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:36:40] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.238 seconds [00:37:47] reedy, filed, adding maxsem as CC [00:38:05] AaronSchulz: deploying the new thumb_handler to ms5 now. [00:38:30] !log deployed new thumb_handler.php with ETag header added in to ms5 [00:38:33] Logged the message, Master [00:38:48] AaronSchulz: tcpdump confirms the headers are present. [00:39:03] yep, seem them in FF [00:40:24] * hexmode decides to write a gadget copier using the API [00:41:48] AaronSchulz: actually... I don't think it's working right. [00:41:57] orly? [00:42:02] * maplebed runs curl [00:42:42] hey whadya know, so they are. [00:42:44] \o/ [00:43:01] I see no appreciable change to ms5's load. [00:43:25] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:59] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2598 [00:44:00] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2598 [00:44:46] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.537 seconds [00:47:08] hexmode: if you are finished with your gadget copier ... WikEd would also be interesting ;) [00:47:38] Saibo: I'm a slow typist... I'm not done yet [00:47:43] :D [00:47:49] plus, things must be done the right way :P [00:48:02] * hexmode tries for quick-n-dirty in any case [00:49:13] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:50:43] New patchset: Bhartshorne; "typoed semicolon should be comma" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2599 [00:51:16] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2599 [00:51:16] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2599 [00:51:37] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.341 seconds [00:53:05] New patchset: Bhartshorne; "yay more typos boo no lint checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2600 [00:53:32] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2600 [00:53:33] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2600 [00:53:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.549 seconds [00:55:40] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:56:39] AaronSchulz: I've deployed the etag-aware stuff to swift (prod cluster); let's see what happens. [00:58:58] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:31] hexmode: okay, found already two of our old scripts which do not work (probably due to the screenscraping... ;-) ) - maybe I can fix them until the deployment [00:59:50] I would have been nice to have this test opportunity a bit earlier [01:00:11] but the beta installation had enough problems with its own ;) [01:00:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.275 seconds [01:02:16] oh.. only 1 script.. the other was failing due to the different name of the file namespace [01:02:22] :) [01:05:25] Saibo: earlier testing opportunities is something I will be looking into fairly soon. [01:05:46] maplebed: I'm trying to see if etags just give 422s or actually delete stuff [01:05:58] AaronSchulz: robla I can't recreate the broken thumbnail anymore! [01:06:07] chrismcmahon: nice [01:07:04] maplebed: excellent! [01:07:31] when I make a connection to swift and abort early, the thumb never appears in swift (but does appear in ms5) [01:07:38] chrismcmahon: fyi...we're messing with thumbnail generation here [01:08:07] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.740 seconds [01:08:15] maplebed: that's pretty much what we want. great! [01:08:27] so... make swift go live again? [01:08:31] :) [01:08:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:08:52] well, I'm trying to think of why not [01:09:06] exisiting bad images. [01:09:38] !log reedy synchronized php-1.19/extensions/Gadgets/Gadgets_body.php [01:09:40] Logged the message, Master [01:13:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:43] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.769 seconds [01:16:36] New patchset: Asher; "my fork of gdash from git://github.com/asher/gdash.git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [01:18:46] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:18:51] !log reedy synchronized php-1.19/extensions/Gadgets/Gadgets_body.php [01:18:53] Logged the message, Master [01:19:43] New patchset: Asher; "my fork of gdash from git://github.com/asher/gdash.git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [01:20:11] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2601 [01:20:12] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [01:20:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.528 seconds [01:21:28] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 606s [01:21:55] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 635s [01:22:31] PROBLEM - MySQL replication status on db1025 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 673s [01:24:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:25:22] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.128 seconds [01:28:31] PROBLEM - Puppet freshness on carbon is CRITICAL: Puppet has not run in the last 10 hours [01:28:40] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.021 seconds [01:29:16] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:39:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:42:37] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.487 seconds [01:42:37] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:42:55] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:46:13] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.094 seconds [01:46:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:35] RECOVERY - MySQL replication status on db1025 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:49:04] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.561 seconds [01:54:10] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:37] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 609s [01:54:46] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 617s [01:56:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.534 seconds [01:57:10] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:00:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:02:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.873 seconds [02:04:40] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.704 seconds [02:08:43] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:14] TimStarling: I just sent out an update to wikitech-l about the thumbnail situation. [02:09:31] ok [02:09:37] I'm about to take off for tonight here, but I figured I'd check in now [02:10:12] I think 1.19 is ok to go. There's some code review and backporting that we should get to, but I guess there's no drop-dead emergenies [02:10:20] emergencies I mean [02:10:22] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:31] that's good [02:10:41] I'm about to reply about this HTCP purge task [02:10:47] ah, cool [02:13:58] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.664 seconds [02:17:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:04] !log LocalisationUpdate completed (1.18) at Wed Feb 15 02:18:04 UTC 2012 [02:18:04] !log LocalisationUpdate failed (1.19) at Wed Feb 15 02:18:04 UTC 2012 [02:18:06] Logged the message, Master [02:18:08] Logged the message, Master [02:24:28] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 243 seconds [02:25:49] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.344 seconds [02:29:53] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:55] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [02:35:19] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 217 seconds [02:37:47] !log on kaulen: re-enabled jsonrpc.cgi and reduced MaxClients from 500 to 100 [02:37:49] Logged the message, Master [02:41:55] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:42:22] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:47:10] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [02:54:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.201 seconds [02:54:58] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 239 seconds [02:58:07] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:12:45] PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq60 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:46] PROBLEM - Host amssq45 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:46] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:47] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq49 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq58 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host bits.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:13:12] PROBLEM - Host cp3002 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:39] PROBLEM - Host cp3001 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:39] PROBLEM - Host br1-knams is DOWN: PING CRITICAL - Packet loss = 100% [03:13:48] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:57] PROBLEM - Host bits.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq17 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq20 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq29 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq21 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:16] PROBLEM - Host knsq23 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:16] PROBLEM - Host knsq26 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:17] PROBLEM - Host knsq18 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:17] PROBLEM - Host csw2-esams is DOWN: PING CRITICAL - Packet loss = 100% [03:14:18] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 100% [03:14:24] PROBLEM - Host hooft is DOWN: PING CRITICAL - Packet loss = 100% [03:14:25] PROBLEM - Host foundation-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:14:34] monitoring issue, or actually down? [03:14:42] PROBLEM - Host knsq28 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq22 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq16 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq25 is DOWN: PING CRITICAL - Packet loss = 100% [03:15:09] dunno, looking [03:15:09] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [03:15:09] PROBLEM - Host foundation-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [03:15:18] PROBLEM - Host ms6 is DOWN: PING CRITICAL - Packet loss = 100% [03:15:18] PROBLEM - Host mediawiki-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:15:18] PROBLEM - Host lily is DOWN: PING CRITICAL - Packet loss = 100% [03:15:23] definitely can't talk to esams from pmtpa [03:15:26] ah. can now [03:15:27] RECOVERY - Host bits.esams.wikimedia.org is UP: PING WARNING - Packet loss = 73%, RTA = 115.55 ms [03:15:27] RECOVERY - Host cp3001 is UP: PING WARNING - Packet loss = 73%, RTA = 121.22 ms [03:15:27] RECOVERY - Host knsq24 is UP: PING WARNING - Packet loss = 28%, RTA = 120.16 ms [03:15:27] RECOVERY - Host knsq27 is UP: PING WARNING - Packet loss = 28%, RTA = 117.23 ms [03:15:27] RECOVERY - Host amssq52 is UP: PING WARNING - Packet loss = 28%, RTA = 119.56 ms [03:15:27] RECOVERY - Host amssq58 is UP: PING WARNING - Packet loss = 28%, RTA = 119.56 ms [03:15:27] RECOVERY - Host amssq49 is UP: PING WARNING - Packet loss = 28%, RTA = 125.82 ms [03:15:28] RECOVERY - Host maerlant is UP: PING WARNING - Packet loss = 66%, RTA = 115.94 ms [03:15:28] RECOVERY - Host ms6 is UP: PING OK - Packet loss = 16%, RTA = 116.82 ms [03:15:35] packet loss there [03:15:36] RECOVERY - Host amssq60 is UP: PING OK - Packet loss = 0%, RTA = 123.57 ms [03:15:36] RECOVERY - Host amssq59 is UP: PING OK - Packet loss = 0%, RTA = 117.85 ms [03:15:36] RECOVERY - Host cp3002 is UP: PING OK - Packet loss = 0%, RTA = 117.40 ms [03:15:36] RECOVERY - Host hooft is UP: PING OK - Packet loss = 0%, RTA = 123.60 ms [03:15:36] RECOVERY - Host knsq21 is UP: PING OK - Packet loss = 0%, RTA = 117.64 ms [03:15:37] RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 0%, RTA = 117.62 ms [03:15:37] RECOVERY - Host amssq56 is UP: PING OK - Packet loss = 0%, RTA = 117.93 ms [03:15:38] RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 123.56 ms [03:15:38] RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 123.70 ms [03:15:39] RECOVERY - Host knsq20 is UP: PING OK - Packet loss = 0%, RTA = 117.63 ms [03:15:39] RECOVERY - Host knsq26 is UP: PING OK - Packet loss = 0%, RTA = 117.54 ms [03:15:40] RECOVERY - Host knsq23 is UP: PING OK - Packet loss = 0%, RTA = 117.87 ms [03:15:40] RECOVERY - Host knsq29 is UP: PING OK - Packet loss = 0%, RTA = 117.70 ms [03:15:41] RECOVERY - Host knsq18 is UP: PING OK - Packet loss = 0%, RTA = 123.21 ms [03:15:41] RECOVERY - Host knsq17 is UP: PING OK - Packet loss = 0%, RTA = 123.59 ms [03:15:42] RECOVERY - Host foundation-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 123.50 ms [03:15:42] RECOVERY - Host csw2-esams is UP: PING OK - Packet loss = 0%, RTA = 125.50 ms [03:15:53] well. that answers that [03:15:54] RECOVERY - Host knsq28 is UP: PING OK - Packet loss = 0%, RTA = 126.84 ms [03:15:54] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 117.08 ms [03:15:54] RECOVERY - Host amssq51 is UP: PING OK - Packet loss = 0%, RTA = 120.42 ms [03:15:54] RECOVERY - Host amssq45 is UP: PING OK - Packet loss = 0%, RTA = 120.76 ms [03:16:03] RECOVERY - Host knsq19 is UP: PING OK - Packet loss = 0%, RTA = 114.12 ms [03:16:03] RECOVERY - Host knsq22 is UP: PING OK - Packet loss = 0%, RTA = 113.45 ms [03:16:03] RECOVERY - Host knsq16 is UP: PING OK - Packet loss = 0%, RTA = 120.75 ms [03:16:03] RECOVERY - Host knsq25 is UP: PING OK - Packet loss = 0%, RTA = 113.59 ms [03:16:08] screwed up route? [03:16:12] RECOVERY - Host br1-knams is UP: PING OK - Packet loss = 0%, RTA = 114.67 ms [03:16:25] i'm checking out the routers... [03:16:30] RECOVERY - Host mediawiki-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 114.38 ms [03:16:39] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 114.41 ms [03:16:57] * maplebed ignores nagios-wm [03:17:08] I've got free hands if someone wants to direct them. [03:17:18] there's nothing actually broken [03:17:23] check ganglia [03:17:40] pmtpa couldn't reach esams, so monitoring went crazy [03:17:49] oh, this is all amsterdam? [03:17:49] i'm trying to figure out why..... [03:17:50] of course, that's a problem, but it seems to be gone now [03:17:53] hm. [03:18:01] maybe the pond snapped a fiber. [03:18:06] transit? [03:18:09] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.458 seconds [03:18:16] or someone temporarily fucked up a route [03:18:20] yeah [03:18:27] it happens occasionally [03:18:37] so none of our links went down [03:18:40] packet loss the opposite way often causes us problems [03:18:42] i would guess fucked up route somewhere [03:18:48] indeed [03:18:53] links or bgp sessions [03:19:21] RECOVERY - Host bits.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 113.46 ms [03:20:33] RECOVERY - Host foundation-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 119.19 ms [03:20:42] RECOVERY - Host lily is UP: PING OK - Packet loss = 0%, RTA = 113.58 ms [03:20:45] well it looks like everything is good again [03:20:51] other than our phones blowing up [03:21:33] heh [03:21:35] yep [03:21:38] jesus [03:21:51] looks like we didn't actually have a problem, according to ganglia [03:21:55] we're going over telia the whole way [03:21:56] someday getting that "no pages when you're asleep" thing working would be awesome [03:21:59] cool [03:22:04] stupid internets! [03:22:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:22:50] and ekrem is a know issue - it's still slow... [03:23:04] signing off [03:23:09] same [03:23:12] * maplebed goes byebye [03:23:42] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.909 seconds [03:24:03] gone [03:24:11] and turning off phone so I can get the rest of my sleep [03:24:18] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [03:24:26] +1 apergos [03:27:45] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:21] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 231 seconds [03:36:09] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 194 seconds [03:38:24] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.004 seconds [03:38:24] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.259 seconds [03:42:27] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:36] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:00] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 29 seconds [04:00:09] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.278 seconds [04:01:03] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 272 seconds [04:05:21] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 3.772 seconds [04:08:30] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [04:08:48] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:39] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:14:57] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 253 seconds [04:17:03] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.660 seconds [04:22:27] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:12] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.452 seconds [04:30:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:30] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.982 seconds [04:40:00] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.239 seconds [04:42:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:48:06] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:49:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.081 seconds [04:54:51] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.464 seconds [04:55:54] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:54] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:00:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.695 seconds [05:04:58] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.957 seconds [05:11:34] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.087 seconds [05:20:43] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:01] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:55] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.551 seconds [05:22:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.029 seconds [05:27:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:49] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:07] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.780 seconds [05:31:22] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.928 seconds [05:52:49] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 1 seconds [05:58:16] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 234 seconds [06:37:07] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:07] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:19] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 0.960 seconds [06:38:19] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.933 seconds [06:56:19] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 5 seconds [07:00:13] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 239 seconds [07:17:01] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [07:23:38] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 260 seconds [07:47:28] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [07:51:04] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [07:53:46] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection refused [07:54:58] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:56:55] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [07:57:13] PROBLEM - Lucene on search6 is CRITICAL: Connection refused [08:02:55] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [08:03:49] RECOVERY - Lucene on search6 is OK: TCP OK - 0.014 second response time on port 8123 [08:21:04] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [08:22:16] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 6 seconds [08:28:43] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 235 seconds [08:29:01] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [08:39:04] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [08:39:04] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [08:48:58] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [08:59:12] nagios-wm bot [08:59:49]