[00:01:06] Saibo: https://test2.wikipedia.org/wiki/Special:UnreviewedPages [00:01:50] hexmode: can I get editor/reviewer privileges on test2 also? [00:02:16] chrismcmahon: I can't give reviewer, but editor, sure [00:02:28] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:30] hexmode: I'll take what I can get, thanks :) [00:02:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:57] chrismcmahon: wiki user? [00:03:33] Reedy: gadgets should show up in preferences after -definitions is updated right? [00:03:37] hexmode: I'm using "ChrisMcMähon" as a disposable account [00:03:48] Sometimes needs piurg [00:03:50] purge [00:03:56] note the umlaut [00:03:57] lol @ ä [00:04:27] from the little-known German branch of the McMahon family. [00:04:41] you got it now [00:04:45] thanks [00:05:38] https://test2.wikipedia.org/wiki/Special:RecentChanges deletion log MediaWiki:Edittools shows up in blue although it is deleted and has no default content?! [00:05:55] scrap that.. [00:06:02] "" is default content ;) [00:07:35] Reedy: purged and still nothing.... going to trim -definition [00:07:52] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.678 seconds [00:11:06] Reedy: trimmed, purged, and... nothing :( [00:11:09] !log reedy synchronized php-1.19/extensions/Contest/Contest.php 'revert live hack' [00:11:12] Logged the message, Master [00:11:18] * hexmode is getting ready to file a bug [00:11:49] What are you trying to do? [00:12:19] edit MediaWiki:Gadgets-definition to get gadgets to show up in prefs [00:12:43] I saw this on beta but put it down to squid [00:13:33] * AaronSchulz reads swift docs [00:14:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:44] https://answers.launchpad.net/swift/+question/160673 [00:16:10] it's probably bs that rewrite.py is using chunked-transfer btw ;) [00:16:10] hexmode: could you run the revision report again? replag is only an hour [00:16:23] (which is less than our rev report lag) [00:17:02] just ran it... checking [00:17:14] maybe I screwed up the script [00:17:41] nope, much better [00:17:54] * Reedy tests locally [00:20:04] maplebed: seems that etag should work, worst case if it doesn't is that we check the response ETag of the put and delete if they don't match [00:20:32] (PUTs give etag in the response) [00:21:15] maplebed: btw, thumb-handler is rooted up :) [00:21:25] makes sense. [00:21:25] hexmode: works fine locally [00:21:34] so it's seemingly some cache issue [00:21:57] still, needs a bug report [00:22:00] I'll do it [00:22:02] yeah [00:22:16] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.671 seconds [00:22:17] but if it didn't work at all, it'd be specifically a gadget bug [00:22:46] meh, I should prolly change the svn one anyway [00:22:54] kind of the point, right? [00:23:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.185 seconds [00:24:49] first error: includePage is not defined - probably a long deprecated function ;) [00:24:50] hexmode: this feels like something Max fixed recently... [00:25:14] But maybe not for this [00:26:05] wtf? includeScript is not defined [00:26:06] Also, Special:Gadgets needs an "export all these gadgets" that includes MW:Gadget-definitions [00:26:10] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:16] New patchset: Bhartshorne; "adding etag awareness to abort failed puts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2598 [00:26:18] Saibo: again [00:26:20] ? [00:26:26] def WTF [00:26:26] AaronSchulz: https://gerrit.wikimedia.org/r/2598 [00:26:31] wait.. that is strange.. :D [00:26:54] I have a includescript in a script and I get an error on the error console. [00:27:00] AaronSchulz: is there any way we can test thumb_handler before throwing it on ms5? [00:27:09] hexmode: but: I have other includescripts in my monobook.js - they work [00:27:14] ..let me check.. [00:27:31] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:27:40] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:56] hexmode: oh.. it is not includeScript but importScript .. doh! ;) That happend when I tried to fix the "includePage" [00:27:56] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2598 [00:28:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:28:54] Saibo: sounds too much like https://bugzilla.wikimedia.org/show_bug.cgi?id=34147 [00:29:47] well.. yes, it seems "includePage" is not supported anymore. However, I still had it in a script - which failed.. [00:30:07] oh... ok [00:30:33] but that is "okay" - afaik that is deprecated for very long an easy to replace by importScript [00:30:54] maplebed: I want to read some swift code first [00:32:34] hexmode: found the bug [00:32:44] AaronSchulz: if we push your change to ms5, we can test this code on the eqiad cluster. [00:32:45] hexmode: it's not getting updated in memc when saving [00:32:48] seems to work ok locally [00:32:50] Reedy: :) [00:32:57] maplebed: ok [00:32:58] ... oh wait.. we can test it in tampa since swift is out of service. [00:32:59] so nevermind. [00:33:19] it would be funny if the etag was for each chunk, in which case everything would always fail [00:33:23] Reedy: should I still file the bug so you can have something to close? [00:33:56] yeah, sure [00:34:06] I might have to leave it to max [00:34:07] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:34:15] good, going to do that right now [00:34:22] was busy with gadget import [00:35:10] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:36:40] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.238 seconds [00:37:47] reedy, filed, adding maxsem as CC [00:38:05] AaronSchulz: deploying the new thumb_handler to ms5 now. [00:38:30] !log deployed new thumb_handler.php with ETag header added in to ms5 [00:38:33] Logged the message, Master [00:38:48] AaronSchulz: tcpdump confirms the headers are present. [00:39:03] yep, seem them in FF [00:40:24] * hexmode decides to write a gadget copier using the API [00:41:48] AaronSchulz: actually... I don't think it's working right. [00:41:57] orly? [00:42:02] * maplebed runs curl [00:42:42] hey whadya know, so they are. [00:42:44] \o/ [00:43:01] I see no appreciable change to ms5's load. [00:43:25] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:59] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2598 [00:44:00] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2598 [00:44:46] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.537 seconds [00:47:08] hexmode: if you are finished with your gadget copier ... WikEd would also be interesting ;) [00:47:38] Saibo: I'm a slow typist... I'm not done yet [00:47:43] :D [00:47:49] plus, things must be done the right way :P [00:48:02] * hexmode tries for quick-n-dirty in any case [00:49:13] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:50:43] New patchset: Bhartshorne; "typoed semicolon should be comma" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2599 [00:51:16] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2599 [00:51:16] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2599 [00:51:37] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.341 seconds [00:53:05] New patchset: Bhartshorne; "yay more typos boo no lint checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2600 [00:53:32] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2600 [00:53:33] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2600 [00:53:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.549 seconds [00:55:40] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:56:39] AaronSchulz: I've deployed the etag-aware stuff to swift (prod cluster); let's see what happens. [00:58:58] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:31] hexmode: okay, found already two of our old scripts which do not work (probably due to the screenscraping... ;-) ) - maybe I can fix them until the deployment [00:59:50] I would have been nice to have this test opportunity a bit earlier [01:00:11] but the beta installation had enough problems with its own ;) [01:00:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.275 seconds [01:02:16] oh.. only 1 script.. the other was failing due to the different name of the file namespace [01:02:22] :) [01:05:25] Saibo: earlier testing opportunities is something I will be looking into fairly soon. [01:05:46] maplebed: I'm trying to see if etags just give 422s or actually delete stuff [01:05:58] AaronSchulz: robla I can't recreate the broken thumbnail anymore! [01:06:07] chrismcmahon: nice [01:07:04] maplebed: excellent! [01:07:31] when I make a connection to swift and abort early, the thumb never appears in swift (but does appear in ms5) [01:07:38] chrismcmahon: fyi...we're messing with thumbnail generation here [01:08:07] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.740 seconds [01:08:15] maplebed: that's pretty much what we want. great! [01:08:27] so... make swift go live again? [01:08:31] :) [01:08:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:08:52] well, I'm trying to think of why not [01:09:06] exisiting bad images. [01:09:38] !log reedy synchronized php-1.19/extensions/Gadgets/Gadgets_body.php [01:09:40] Logged the message, Master [01:13:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:43] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.769 seconds [01:16:36] New patchset: Asher; "my fork of gdash from git://github.com/asher/gdash.git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [01:18:46] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:18:51] !log reedy synchronized php-1.19/extensions/Gadgets/Gadgets_body.php [01:18:53] Logged the message, Master [01:19:43] New patchset: Asher; "my fork of gdash from git://github.com/asher/gdash.git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [01:20:11] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2601 [01:20:12] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [01:20:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.528 seconds [01:21:28] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 606s [01:21:55] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 635s [01:22:31] PROBLEM - MySQL replication status on db1025 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 673s [01:24:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:25:22] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.128 seconds [01:28:31] PROBLEM - Puppet freshness on carbon is CRITICAL: Puppet has not run in the last 10 hours [01:28:40] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.021 seconds [01:29:16] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:39:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:42:37] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.487 seconds [01:42:37] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:42:55] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:46:13] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.094 seconds [01:46:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:35] RECOVERY - MySQL replication status on db1025 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [01:49:04] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.561 seconds [01:54:10] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:37] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 609s [01:54:46] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 617s [01:56:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.534 seconds [01:57:10] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:00:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:02:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.873 seconds [02:04:40] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.704 seconds [02:08:43] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:14] TimStarling: I just sent out an update to wikitech-l about the thumbnail situation. [02:09:31] ok [02:09:37] I'm about to take off for tonight here, but I figured I'd check in now [02:10:12] I think 1.19 is ok to go. There's some code review and backporting that we should get to, but I guess there's no drop-dead emergenies [02:10:20] emergencies I mean [02:10:22] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:31] that's good [02:10:41] I'm about to reply about this HTCP purge task [02:10:47] ah, cool [02:13:58] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.664 seconds [02:17:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:04] !log LocalisationUpdate completed (1.18) at Wed Feb 15 02:18:04 UTC 2012 [02:18:04] !log LocalisationUpdate failed (1.19) at Wed Feb 15 02:18:04 UTC 2012 [02:18:06] Logged the message, Master [02:18:08] Logged the message, Master [02:24:28] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 243 seconds [02:25:49] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.344 seconds [02:29:53] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:55] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [02:35:19] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 217 seconds [02:37:47] !log on kaulen: re-enabled jsonrpc.cgi and reduced MaxClients from 500 to 100 [02:37:49] Logged the message, Master [02:41:55] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:42:22] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:47:10] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [02:54:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.201 seconds [02:54:58] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 239 seconds [02:58:07] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:12:45] PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:45] PROBLEM - Host amssq60 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:46] PROBLEM - Host amssq45 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:46] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 100% [03:12:47] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq49 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq58 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:03] PROBLEM - Host bits.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:13:12] PROBLEM - Host cp3002 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:39] PROBLEM - Host cp3001 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:39] PROBLEM - Host br1-knams is DOWN: PING CRITICAL - Packet loss = 100% [03:13:48] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:57] PROBLEM - Host bits.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq17 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq20 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq29 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:15] PROBLEM - Host knsq21 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:16] PROBLEM - Host knsq23 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:16] PROBLEM - Host knsq26 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:17] PROBLEM - Host knsq18 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:17] PROBLEM - Host csw2-esams is DOWN: PING CRITICAL - Packet loss = 100% [03:14:18] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 100% [03:14:24] PROBLEM - Host hooft is DOWN: PING CRITICAL - Packet loss = 100% [03:14:25] PROBLEM - Host foundation-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:14:34] monitoring issue, or actually down? [03:14:42] PROBLEM - Host knsq28 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq22 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq16 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:51] PROBLEM - Host knsq25 is DOWN: PING CRITICAL - Packet loss = 100% [03:15:09] dunno, looking [03:15:09] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [03:15:09] PROBLEM - Host foundation-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [03:15:18] PROBLEM - Host ms6 is DOWN: PING CRITICAL - Packet loss = 100% [03:15:18] PROBLEM - Host mediawiki-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [03:15:18] PROBLEM - Host lily is DOWN: PING CRITICAL - Packet loss = 100% [03:15:23] definitely can't talk to esams from pmtpa [03:15:26] ah. can now [03:15:27] RECOVERY - Host bits.esams.wikimedia.org is UP: PING WARNING - Packet loss = 73%, RTA = 115.55 ms [03:15:27] RECOVERY - Host cp3001 is UP: PING WARNING - Packet loss = 73%, RTA = 121.22 ms [03:15:27] RECOVERY - Host knsq24 is UP: PING WARNING - Packet loss = 28%, RTA = 120.16 ms [03:15:27] RECOVERY - Host knsq27 is UP: PING WARNING - Packet loss = 28%, RTA = 117.23 ms [03:15:27] RECOVERY - Host amssq52 is UP: PING WARNING - Packet loss = 28%, RTA = 119.56 ms [03:15:27] RECOVERY - Host amssq58 is UP: PING WARNING - Packet loss = 28%, RTA = 119.56 ms [03:15:27] RECOVERY - Host amssq49 is UP: PING WARNING - Packet loss = 28%, RTA = 125.82 ms [03:15:28] RECOVERY - Host maerlant is UP: PING WARNING - Packet loss = 66%, RTA = 115.94 ms [03:15:28] RECOVERY - Host ms6 is UP: PING OK - Packet loss = 16%, RTA = 116.82 ms [03:15:35] packet loss there [03:15:36] RECOVERY - Host amssq60 is UP: PING OK - Packet loss = 0%, RTA = 123.57 ms [03:15:36] RECOVERY - Host amssq59 is UP: PING OK - Packet loss = 0%, RTA = 117.85 ms [03:15:36] RECOVERY - Host cp3002 is UP: PING OK - Packet loss = 0%, RTA = 117.40 ms [03:15:36] RECOVERY - Host hooft is UP: PING OK - Packet loss = 0%, RTA = 123.60 ms [03:15:36] RECOVERY - Host knsq21 is UP: PING OK - Packet loss = 0%, RTA = 117.64 ms [03:15:37] RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 0%, RTA = 117.62 ms [03:15:37] RECOVERY - Host amssq56 is UP: PING OK - Packet loss = 0%, RTA = 117.93 ms [03:15:38] RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 123.56 ms [03:15:38] RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 123.70 ms [03:15:39] RECOVERY - Host knsq20 is UP: PING OK - Packet loss = 0%, RTA = 117.63 ms [03:15:39] RECOVERY - Host knsq26 is UP: PING OK - Packet loss = 0%, RTA = 117.54 ms [03:15:40] RECOVERY - Host knsq23 is UP: PING OK - Packet loss = 0%, RTA = 117.87 ms [03:15:40] RECOVERY - Host knsq29 is UP: PING OK - Packet loss = 0%, RTA = 117.70 ms [03:15:41] RECOVERY - Host knsq18 is UP: PING OK - Packet loss = 0%, RTA = 123.21 ms [03:15:41] RECOVERY - Host knsq17 is UP: PING OK - Packet loss = 0%, RTA = 123.59 ms [03:15:42] RECOVERY - Host foundation-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 123.50 ms [03:15:42] RECOVERY - Host csw2-esams is UP: PING OK - Packet loss = 0%, RTA = 125.50 ms [03:15:53] well. that answers that [03:15:54] RECOVERY - Host knsq28 is UP: PING OK - Packet loss = 0%, RTA = 126.84 ms [03:15:54] RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 117.08 ms [03:15:54] RECOVERY - Host amssq51 is UP: PING OK - Packet loss = 0%, RTA = 120.42 ms [03:15:54] RECOVERY - Host amssq45 is UP: PING OK - Packet loss = 0%, RTA = 120.76 ms [03:16:03] RECOVERY - Host knsq19 is UP: PING OK - Packet loss = 0%, RTA = 114.12 ms [03:16:03] RECOVERY - Host knsq22 is UP: PING OK - Packet loss = 0%, RTA = 113.45 ms [03:16:03] RECOVERY - Host knsq16 is UP: PING OK - Packet loss = 0%, RTA = 120.75 ms [03:16:03] RECOVERY - Host knsq25 is UP: PING OK - Packet loss = 0%, RTA = 113.59 ms [03:16:08] screwed up route? [03:16:12] RECOVERY - Host br1-knams is UP: PING OK - Packet loss = 0%, RTA = 114.67 ms [03:16:25] i'm checking out the routers... [03:16:30] RECOVERY - Host mediawiki-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 114.38 ms [03:16:39] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 114.41 ms [03:16:57] * maplebed ignores nagios-wm [03:17:08] I've got free hands if someone wants to direct them. [03:17:18] there's nothing actually broken [03:17:23] check ganglia [03:17:40] pmtpa couldn't reach esams, so monitoring went crazy [03:17:49] oh, this is all amsterdam? [03:17:49] i'm trying to figure out why..... [03:17:50] of course, that's a problem, but it seems to be gone now [03:17:53] hm. [03:18:01] maybe the pond snapped a fiber. [03:18:06] transit? [03:18:09] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.458 seconds [03:18:16] or someone temporarily fucked up a route [03:18:20] yeah [03:18:27] it happens occasionally [03:18:37] so none of our links went down [03:18:40] packet loss the opposite way often causes us problems [03:18:42] i would guess fucked up route somewhere [03:18:48] indeed [03:18:53] links or bgp sessions [03:19:21] RECOVERY - Host bits.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 113.46 ms [03:20:33] RECOVERY - Host foundation-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 119.19 ms [03:20:42] RECOVERY - Host lily is UP: PING OK - Packet loss = 0%, RTA = 113.58 ms [03:20:45] well it looks like everything is good again [03:20:51] other than our phones blowing up [03:21:33] heh [03:21:35] yep [03:21:38] jesus [03:21:51] looks like we didn't actually have a problem, according to ganglia [03:21:55] we're going over telia the whole way [03:21:56] someday getting that "no pages when you're asleep" thing working would be awesome [03:21:59] cool [03:22:04] stupid internets! [03:22:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:22:50] and ekrem is a know issue - it's still slow... [03:23:04] signing off [03:23:09] same [03:23:12] * maplebed goes byebye [03:23:42] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.909 seconds [03:24:03] gone [03:24:11] and turning off phone so I can get the rest of my sleep [03:24:18] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [03:24:26] +1 apergos [03:27:45] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:21] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 231 seconds [03:36:09] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 194 seconds [03:38:24] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.004 seconds [03:38:24] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.259 seconds [03:42:27] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:36] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:00] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 29 seconds [04:00:09] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.278 seconds [04:01:03] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 272 seconds [04:05:21] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 3.772 seconds [04:08:30] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [04:08:48] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:39] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:14:57] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 253 seconds [04:17:03] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.660 seconds [04:22:27] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:12] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.452 seconds [04:30:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:30] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.982 seconds [04:40:00] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.239 seconds [04:42:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:48:06] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:49:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.081 seconds [04:54:51] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.464 seconds [04:55:54] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:54] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:00:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.695 seconds [05:04:58] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.957 seconds [05:11:34] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.087 seconds [05:20:43] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:01] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:55] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.551 seconds [05:22:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.029 seconds [05:27:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:49] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:07] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.780 seconds [05:31:22] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.928 seconds [05:52:49] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 1 seconds [05:58:16] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 234 seconds [06:37:07] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:07] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:19] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 0.960 seconds [06:38:19] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.933 seconds [06:56:19] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 5 seconds [07:00:13] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 239 seconds [07:17:01] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [07:23:38] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 260 seconds [07:47:28] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:50:01] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [07:51:04] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [07:53:46] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection refused [07:54:58] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:56:55] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [07:57:13] PROBLEM - Lucene on search6 is CRITICAL: Connection refused [08:02:55] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [08:03:49] RECOVERY - Lucene on search6 is OK: TCP OK - 0.014 second response time on port 8123 [08:21:04] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [08:22:16] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 6 seconds [08:28:43] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 235 seconds [08:29:01] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [08:39:04] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [08:39:04] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [08:48:58] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [08:59:12] nagios-wm bot [08:59:49] !log test [08:59:51] Logged the message, Master [09:00:00] cool :) [09:27:16] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 1 seconds [09:32:22] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 264 seconds [09:46:28] PROBLEM - check_gcsip on payments4 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [09:50:22] RECOVERY - check_gcsip on payments4 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.164 second response time [10:01:17] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 26 seconds [10:04:26] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 230 seconds [10:09:41] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [10:14:47] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 231 seconds [10:31:22] MediaWiki::restInPeace() commits the deferred updates and closes the task gracefully [10:31:40] my god, what a nice function name [10:32:02] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [10:32:27] who named it? [10:32:41] Nirvanchik: svn blame it [10:34:51] saper:why? [10:35:56] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 190 seconds [10:36:16] svn = subversion? [10:36:30] who can it [10:36:33] how can it [10:39:10] "svn blame" or "svn annotate" is an SVN command which says which line was changed by whom; you need a "checkout" - a local copy of MediaWiki code checked out using "svn checkout" command or check the web interface at http://svn.wikimedia.org/viewvc/ [10:41:28] the problem is, "svn blame" shows you the *last* change, not the *first* - so you need to go back many releases to find out [10:43:31] Nirvanchik: http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Wiki.php?r1=12610&r2=12611& says it was added by Magnus Manske on Wed Jan 11 15:46:01 2006 UTC [10:46:11] saper: thanks. this explains everything. sorry for my lack of svn knowledge [10:58:11] Nirvanchik: no problem, I didn't have this knowledge too few years ago:) [11:09:08] New patchset: Dzahn; "fix pubkey for aengels" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2602 [11:09:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2602 [11:11:22] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [11:14:17] New review: Dzahn; "wrong one, user did not have matching private key. yes, not leaving the old one in as "absent", it w..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2602 [11:14:18] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2602 [11:14:58] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 205 seconds [11:26:49] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [11:29:40] PROBLEM - Puppet freshness on carbon is CRITICAL: Puppet has not run in the last 10 hours [11:30:43] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 198 seconds [11:48:34] Hi all, I'd like to report a security incident about en.wikipedia.org happened yesterday (Feb 14), from 21h45 (paris time) for at least 10 minutes. XSS I guess. redirection to a phishing site. [11:50:13] ggherdov: either email security@wikimedia.org or go to https://bugzilla.wikimedia.org and select "Security" when filing a new bug [11:50:13] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [11:50:32] p858snake|l_: ok thx [11:50:40] Yeah, let's not post details about XSS exploits in a public channel :) [11:51:02] RoanKattouw: I didn't do that! [11:51:22] I know :) [11:51:37] mailing at securty@wikimedia.org . thx [11:58:01] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 212 seconds [12:20:49] RoanKattouw, hello there [12:20:58] Hey [12:21:27] RoanKattouw, I have some problems with my account on arwiki... a silly crat renamed me and I never asked him to do so [12:21:36] I asked him to usurp a user and he renamed me to this user [12:21:53] is it possible to merge the accounts ? now I have some contributions on one and some on the other... [12:23:36] I have no idea [12:25:28] RoanKattouw, you cannot give the contributions from [[User:Lou]] to [[User:Quentinv57]] on arwiki ? [12:26:04] I'm not at all familiar with user renames or usurpation [12:26:21] I've fixed incomplete renames a few times before but that's easy [12:35:33] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:36:36] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:24] Quentinv57, that's not possible or at least nobody has ever done it [12:45:31] although some people asked [12:51:50] Nemo_bis, what kind of weapon should I use for this silly crat ? [12:52:11] Quentinv57, make yourself crat there [12:52:21] usurp your account back after choosing one [12:53:08] Nemo_bis, the issue is that I did not notice the problem and I recreated account Quentinv57 and contributed with it [12:53:19] losing a couple edits shouldn't be a tragedy [12:53:32] I still have many edits in pre-SUL accounts [12:53:38] Nemo_bis, but renaming my account without any reason [12:53:39] is [12:53:52] seriously this crat is really silly [12:54:16] come on, renaming with CentralAuth is an inherently broken feature [12:54:23] you can't really blame him [12:54:41] Although yes, maybe when he sees a big red warning he should think more. [12:55:54] Nemo_bis, the thing is that I never asked [12:56:15] What can I say, deflag him. [12:56:20] I see the guy waking up and saying himself "oh, who am I going to rename today ?" [12:56:29] "Blatantly incompetent, a threat for the project." [12:56:33] Nemo_bis, that's a pretty good idea [12:56:38] not really [12:57:31] By the way, how's possible that you didn't receive an enotif for the rename? [12:58:29] Nemo_bis, don't think that MediaWiki sends an email or so for that [12:58:33] yes it does [12:59:40] Nemo_bis, at least I did not receive one or read it... it was in 2010 [12:59:57] Come on, there's nothing to be done here, 13 edits on the new account and only 5 in the old. It's almost as useless as the global account deletion toy denied yesterday. :p [12:59:58] I just noticed now that one of my user page is a redirect [13:00:25] Perhaps you missed https://bugzilla.wikimedia.org/show_bug.cgi?id=14901 ? [13:00:27] Nemo_bis, almost [13:01:30] Nemo_bis, I just wonder why I got renamed there... moreover the SUL account [[User:Lou]] already belongs to someone else [13:12:10] !log hashar synchronized wmf-config/codereview.php [13:12:12] Logged the message, Master [13:22:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.711 seconds [13:26:03] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:29:03] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.883 seconds [13:30:59] Nemo_bis: aren't u in SF,CA? [13:35:48] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:30] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.965 seconds [13:42:33] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:10] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [13:52:18] Nirvanchik, surely not [13:55:40] New patchset: Catrope; "Add .gitreview file" [operations/dumps] (master) - https://gerrit.wikimedia.org/r/2603 [13:56:03] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 204 seconds [13:59:13] New patchset: Hashar; "adding in .gitreview" [operations/dumps] (master) - https://gerrit.wikimedia.org/r/2604 [14:00:41] Change abandoned: Catrope; "Already done correctly (repo path is wrong, missing .git) in https://gerrit.wikimedia.org/r/#change,..." [operations/dumps] (master) - https://gerrit.wikimedia.org/r/2604 [14:02:14] New patchset: Catrope; "Add .gitreview file for the ariel branch as well" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/2605 [14:02:27] New review: gerrit2; "Lint check passed." [operations/dumps] (ariel); V: 1 - https://gerrit.wikimedia.org/r/2605 [14:06:51] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 134 MB (1% inode=62%): /var/lib/ureadahead/debugfs 134 MB (1% inode=62%): [14:07:54] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:08:12] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 170 MB (2% inode=62%): /var/lib/ureadahead/debugfs 170 MB (2% inode=62%): [14:10:27] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [14:10:45] RECOVERY - Disk space on srv223 is OK: DISK OK [14:12:57] New review: ArielGlenn; "(no comment)" [operations/dumps] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2603 [14:12:57] Change merged: ArielGlenn; [operations/dumps] (master) - https://gerrit.wikimedia.org/r/2603 [14:14:12] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [14:14:42] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2605 [14:14:43] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/2605 [14:16:10] RECOVERY - Disk space on srv224 is OK: DISK OK [14:19:27] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 227 seconds [14:41:51] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 19 seconds [14:43:02] +1 for my isp [14:46:26] New review: Hashar; "Looks like the .git is optional since I have pushed that change without it." [operations/dumps] (master) - https://gerrit.wikimedia.org/r/2604 [14:47:06] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 249 seconds [14:54:54] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.546 seconds [14:59:06] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.971 seconds [15:05:42] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:54] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 9 seconds [15:06:54] !log reedy synchronized php-1.19/extensions/SpamBlacklist/SpamBlacklistHooks.php 'r111543' [15:06:57] Logged the message, Master [15:09:09] PROBLEM - Lucene on search1002 is CRITICAL: Connection refused [15:10:48] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 243 seconds [15:12:34] New patchset: Hashar; "misc::contint::jenkins now install jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2606 [15:18:57] New patchset: Catrope; "misc::contint::jenkins now install jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2606 [15:20:13] New review: Catrope; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2606 [15:20:23] New review: Hashar; "Thanks Roan!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2606 [15:21:36] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.747 seconds [15:23:24] PROBLEM - Disk space on db40 is CRITICAL: DISK CRITICAL - free space: /a 91029 MB (3% inode=99%): [15:24:09] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.848 seconds [15:24:09] PROBLEM - MySQL disk space on db40 is CRITICAL: DISK CRITICAL - free space: /a 90973 MB (3% inode=99%): [15:24:45] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.027 second response time on port 8123 [15:27:54] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:13] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:06] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [15:30:27] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 15 seconds [15:35:34] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 262 seconds [15:36:36] PROBLEM - Lucene on search1002 is CRITICAL: Connection refused [15:47:51] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.482 seconds [15:53:04] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:04] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [16:01:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 84, down: 3, dormant: 0, excluded: 0, unused: 0BRae3.1019: down - Subnet private1-c-eqiadBRae3.32767: down - BRae3.1003: down - Subnet public1-c-eqiadBR [16:02:31] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.098 seconds [16:05:00] Reedy: I also need import on test2 to import gadgets to test [16:05:03] halp? [16:06:57] ? [16:07:14] import priv? [16:07:19] lol [16:07:47] Special:UserRights? [16:08:14] done [16:08:46] :) [16:08:48] tyvm [16:10:03] hexmode: I looked at the php sessions error you had on debian : https://bugzilla.wikimedia.org/show_bug.cgi?id=34376 [16:10:12] not sure we can do anything about it :/ [16:10:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:28] well, it isn't a huge deal [16:10:38] just thought it should be recorded [16:10:46] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 87, down: 3, dormant: 0, excluded: 0, unused: 0BRae3.1003: down - Subnet public1-c-eqiadBRae3.1019: down - Subnet private1-c-eqiadBRae3.32767: down - BR [16:18:11] hi [16:18:13] New review: Mark Bergsma; "Can you please move this out of the puppet repository, and put it in operations/software instead? Or..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2601 [16:18:39] facing small issue with the proofreading module. [16:18:48] have uploaded the djvu file. [16:19:13] when i edit it, in some cases the image is displayed. but for some its not. [16:20:03] djvu file: http://mr.wikisource.org/wiki/%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%95%E0%A5%8D%E0%A4%B0%E0%A4%AE%E0%A4%A3%E0%A4%BF%E0%A4%95%E0%A4%BE:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu [16:20:22] Image is displayed: http://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/3&action=edit&redlink=1 [16:20:32] Image is not displayed: http://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/1&action=edit&redlink=1 [16:23:28] apergos, I don't remember, did you update the docs on export in mw.o? [16:23:40] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.1409206087 (gt 8.0) [16:23:44] I don't think so [16:23:48] ok [16:23:49] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.227 seconds [16:25:55] RECOVERY - Puppet freshness on carbon is OK: puppet ran at Wed Feb 15 16:25:43 UTC 2012 [16:26:04] RECOVERY - SSH on carbon is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:26:13] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.85601438596 [16:27:40] New patchset: Mark Bergsma; "Prepare oxygen for multicast relaying" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2607 [16:28:01] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:29] New patchset: Mark Bergsma; "Comment again, until I have time to look at it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2608 [16:36:59] Reedy: just imported MediaWiki:Gadgets-definition but no change is shown? [16:37:08] WTF? [16:37:21] hexmode: It might not hook into import, try making a whitespace edit to that page [16:37:28] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.005 seconds [16:37:28] (This is fixed in the RL2 branch) [16:37:58] Yeah [16:38:08] It doesn't [16:38:16] Reedy: so, I edit the page and it has the old contents [16:38:39] AfterImportPage? [16:38:48] So it didn't import by the sounds of it [16:39:04] It shows the import in page history [16:39:06] weir [16:39:07] d [16:40:00] and the import: https://test2.wikipedia.org/w/index.php?title=Special:RecentChanges&limit=100 [16:41:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:41:45] it looks like you've reverted it from the history... [16:42:48] Revert it back to anomies rev? [16:43:08] I don't understand how I've reverted it, but ok [16:43:38] lol [16:43:41] I'm not sure either [16:44:02] unless it didn't import it as the newest rev [16:44:02] I see the comment [16:44:06] and yours is newer.. [16:44:11] "browsing" and ??? [16:44:16] yeah [16:44:21] maybe that is it [16:44:57] I'm not sure if htat's how it's supposed to work [16:45:16] I've only ever imported pages that didn't exist at all locally [16:47:20] New review: Dzahn; "enhanced page_all SMS script (the one for manual use, does not affect nagios)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2264 [16:47:23] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2264 [16:49:42] Reedy: it makes sense if you want to splice in a history [16:49:50] aye [16:52:16] Reedy: quick question. The djvu issue which we discussed yestarday is partially resolved. on one page i am able to see the image using proofread module while editing. but for some am, blank image is displayed. should i raise ticket bugzilla? [16:54:43] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.558 seconds [16:55:51] shantanoo: have you tried purging them? [16:56:16] Reedy: nope. how do i do that? [16:57:07] append ?action=purge [16:58:00] Quedel: anything to report yet? [16:58:28] yeah, i try it to find some words in english for this happening and how to reconstruct it [16:58:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:00] Reedy: didn't work. did purge on djvu file. http://mr.wikisource.org/wiki/%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%95%E0%A5%8D%E0%A4%B0%E0%A4%AE%E0%A4%A3%E0%A4%BF%E0%A4%95%E0%A4%BE:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu?action=purge [16:59:30] chrismcmahon: was german a language you could use? [17:00:07] hexmode: not sure I understand the question. use for what? [17:00:34] https://upload.wikimedia.org/wikisource/mr/thumb/e/ef/%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/page2-180px-%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu.jpg [17:00:45] CdbReader_PHP::read: read from CDB file failed, file may be corrupted [17:00:47] That's interesting [17:00:59] chrismcmahon: Quedel isn't comfy with english. I was trying to figure out if you could understand german if he reported in german. [17:01:39] hexmode: I doubt I'm that good. [17:02:18] I test a script (for deleting) from my de.wp-scripts. By testing it a editing-collision wasn't detected (or whatever "Bearbeitungskonflikt" in english is). See this link: https://test2.wikipedia.org/w/index.php?title=Wikipedia%3AL%C3%B6schkandidaten%2F15._Februar_2012&diff=35596&oldid=35595 - the script opened a new tab from this site, adding via script a new section and inserting some text... [17:02:20] ...via script. In the first tab the article was edited and get some code in front of the text. Both editing tabs I have manually saved, but the first saved window-edit was deleted by the second. At de.wp there would be a warning about conflict by editing the same site. [17:02:48] Reedy: i have the pdf file. should i try with it? i used pdf2djvu utility for conversion. [17:03:07] Nah [17:03:15] It's coming up as a site error [17:03:19] Quedel: url for script? [17:03:21] oh. ok. [17:03:41] although the thumbnailer is surfacing it, mediawiki itself would seem to be at fault [17:03:48] or rather, the language cache files are [17:03:53] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.982 seconds [17:04:46] Give it an hour or so, and I'll initially ping the relevant people, but might need to look into the cache file issue a bit more closely [17:05:12] Reedy: http://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/3&action=edit&redlink=1 is working. but [17:05:19] Reedy: http://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/3&action=edit&redlink=1 is working. but [17:05:20] Reedy: http://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/3&action=edit&redlink=1 is working. but [17:05:22] Reedy: http://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/3&action=edit&redlink=1 is working. but [17:05:25] Reedy: http://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/3&action=edit&redlink=1 is working. but [17:05:26] Reedy: http://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/3&action=edit&redlink=1 is working. but [17:05:34] btw, RoanKattouw_away noticed that LU runs with a fuckload of errors again? [17:05:48] Reedy: http://mr.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A4%BE%E0%A4%A8:%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/3&action=edit&redlink=1 is working. [17:05:49] shantanoo: you just posted the same message 5 times [17:05:52] ... [17:05:52] @hexmode: https://test2.wikipedia.org/wiki/Male_user:Quedel/monobook.js - but I don't know the right sub-script where it is loaded. The behaviour of the script is as it should be, but no edit-conflict was displayed by mediawiki [17:05:54] or running on demand is does [17:06:07] (my bad, 7) [17:06:15] your bad? :p [17:06:22] Snowolf: oops. sorry. it was not displayed on my screen... [17:06:50] thought there is limitation for sending text to this channel... [17:06:51] Reedy: a silly way to say "my mistake" in broken english used on certain DoD servers which I can't seem to get rid of [17:07:15] (dod being Day of Defeat, not the department of defense) [17:07:30] Reedy: for the 3rd page, image is displayed, but the thumbnail url is showing error [17:07:55] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:09:18] Yeah [17:09:22] shantanoo, you're not seriously expecting DjVu thumbs to work? :p [17:09:24] It's most likely all related [17:09:28] ish [17:10:06] Quedel: switching to monobook and trying to use your script [17:10:41] Nemo_bis: hey, so is there any other format which i can use? tiff with multiple pages? [17:10:47] uh? https://upload.wikimedia.org/wikisource/mr/thumb/e/ef/%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/page3-1200px-%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu.jpg [17:10:59] Nemo_bis: the error is completely unrelated to djvu [17:11:08] exactly [17:11:16] http://p.defau.lt/?bohQFl76k4yVuvXVMAM2Ng just in case [17:11:30] it's very unlikely to fix itself ;) [17:11:41] :) [17:11:51] Reedy: I had very vaguely noticed that in the SAL, have not investigated yet [17:12:16] RoanKattouw_away: Tim fixed some permission things again on monday night [17:12:16] I was just saying that thumbs are something Wikisource users will constantly have to battle with. [17:12:36] Reedy: can't find any issue with https://mr.wikisource.org/wiki/%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%95%E0%A5%8D%E0%A4%B0%E0%A4%AE%E0%A4%A3%E0%A4%BF%E0%A4%95%E0%A4%BE:Wind_in_the_Willows_%281913%29.djvu [17:12:53] Does that one come from commons? [17:13:09] Quedel: I'm on that url... how can I see the bug? What do I do? [17:13:42] Reedy: could be. phe uploaded it. let me try if he is around. [17:13:59] if it is from commons, there'll be pre-existing thumbs [17:14:12] yeah [17:14:12] I'm at monobook already. Tested second, same effect: This site was loaded to edit in the same window and a new tab in this version ( https://test2.wikipedia.org/w/index.php?title=Wikipedia:L%C3%B6schkandidaten/15._Februar_2012&oldid=35601 ), after it the script adds the code in front at 1st tab, the new section as 2nd tab. Now I press Save at 2nd tab (with the new section) and only a second... [17:14:12] https://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Wind_in_the_Willows_%281913%29.djvu/page1-220px-Wind_in_the_Willows_%281913%29.djvu.jpg [17:14:14] ...after I press save on the first tab (with the insert text at front of the site). But then, the new section were deleted [17:15:11] @hexmode: https://test2.wikipedia.org/w/index.php?title=Wikipedia%3AL%C3%B6schkandidaten%2F15._Februar_2012&diff=35604&oldid=35603 at this dif, the last section ( " ==[[Wikipedia:Löschkandidaten ...) were deleted, but it should still remain there [17:15:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.382 seconds [17:15:52] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.209 seconds [17:17:22] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.031 second response time on port 8123 [17:18:19] ok, so you're expecting both sections, right? But it looks like the second save doesn't merge the first, right? [17:18:57] yes, hexmode [17:19:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:01] robla: https://upload.wikimedia.org/wikisource/mr/thumb/e/ef/%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/page3-1200px-%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu.jpg [17:20:04] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:28] ok, Can you file a new bug here: https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki [17:20:43] robla: it's not a swift/thumbnail bug itself.. It's MW [17:20:59] If this is a regression -- it sounds like it -- then it def needs to be fixed [17:21:28] hmm...I wonder if we have some sort of cache size issue that's biting us [17:21:36] Reedy: does this ^^ last 3 comments ^^ sound like a regression? [17:21:50] i.e. what we saw yesterday+Monday on 1.19, but now bled over to 1.18 [17:22:01] robla: tim logged a bug for these errors somewhere, but can't find it [17:22:12] robla: LU isn't actually running properly either (permission errors) [17:22:16] so it could be numerous things [17:22:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.923 seconds [17:24:20] hexmode: https://test2.wikipedia.org/w/index.php?title=File:AaatestSonnepalmenstrand-portrait_new.jpg&diff=35610&oldid=35609 same here [17:24:35] okay, i need to register there first. I tested it versa: if i first save the edit with text in front and then the edit with the new section, both edits are merged correctly [17:24:52] oh.. no [17:24:54] not here [17:25:00] quedel's text is still in - sorry [17:25:23] Saibo: could you file a bz report so Quedel can register later and comment? [17:25:28] It could be if one edit is adding a new section [17:25:59] hexmode: yes, okay [17:26:05] :) [17:26:32] hexmode: against 1.19 or against 1.19 deployment? [17:26:37] robla: Reedy: looks like quedel found a significant regression with edit conflicts [17:26:43] 1.19 [17:26:47] k [17:26:47] bug #? [17:26:58] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:00] robla: none yet - I will report it soon [17:34:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.000 seconds [17:36:58] New patchset: Demon; "Adding .gitreview" [mediawiki/tools/mwdumper] (master) - https://gerrit.wikimedia.org/r/2609 [17:37:16] New review: Demon; "(no comment)" [mediawiki/tools/mwdumper] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2609 [17:37:16] Change merged: Demon; [mediawiki/tools/mwdumper] (master) - https://gerrit.wikimedia.org/r/2609 [17:38:49] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:56] New patchset: RobH; "added candium and roan to access it to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2610 [17:40:10] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.813 seconds [17:40:11] RobH: You mean cadmium? [17:40:25] yea i think i typo'd the damned file as well [17:40:26] sigh [17:40:34] (Although I'm sure some people would argue that candy-um should totally be an element :D ) [17:41:52] slurp [17:42:01] better not to confuse those though [17:44:04] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:31] New patchset: RobH; "added candium and roan to access it to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2610 [17:45:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2610 [17:47:15] New review: RobH; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2610 [17:47:15] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2610 [17:47:40] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.667 seconds [17:51:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:51:52] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [17:55:55] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [17:57:52] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [17:59:50] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 241 seconds [18:01:24] robla, any news on this? https://bugzilla.wikimedia.org/show_bug.cgi?id=16112#c26 [18:01:52] woosters: ^ [18:03:23] hexmode: Reedy: scrap that previous Edit Conflict problem report. There seem to be problems (we are currently trying to nail them down) but they are existing also in 1.18 on production wikis ;) [18:04:19] Saibo: k, but I'd still like a bug report for this. It seems icky [18:04:25] yes - sure [18:04:27] even if it isn't a blocker [18:06:05] Nemo_bis: here's the scoop. Reedy is will support someone in ops to set that up. woosters assigned it to mutante, but mutante is going to be out for the next two weeks. I don't know if woosters has a plan B [18:06:38] nice :) [18:06:55] Nemo_bis: 1.19 has us a little busy atm :) [18:06:58] We can surely wait 2-3 weeks [18:06:59] let me see if i could find someone to do it [18:07:00] of course [18:07:10] that would be even better :) [18:07:20] has to be this today or tomorrow [18:07:29] otherwise would have to be after 1.19 [18:07:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.464 seconds [18:08:58] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:10] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [18:12:30] Reedy: now that I'm not trying to multitask, I'm finally figuring out the CDB problem you referred me to earlier [18:12:37] we need ops support to fix? [18:14:27] AaronSchulz: here's the problem I'm talking about: (09:20:01 AM) Reedy: robla: https://upload.wikimedia.org/wikisource/mr/thumb/e/ef/%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu/page3-1200px-%E0%A4%AB%E0%A5%81%E0%A4%B2%E0%A4%BE%E0%A4%9A%E0%A4%BE_%E0%A4%AA%E0%A5%8D%E0%A4%B0%E0%A4%AF%E0%A5%8B%E0%A4%97.djvu.jpg [18:14:31] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.701 seconds [18:15:09] (09:22:02 AM) Reedy: robla: tim logged a bug for these errors somewhere, but can't find it [18:15:09] (09:22:12 AM) Reedy: robla: LU isn't actually running properly either (permission errors) [18:15:29] New patchset: Demon; "Adding redirect for easier finding of gitweb urls" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2611 [18:15:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:13] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.479 seconds [18:18:16] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 2 seconds [18:21:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:18] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:52] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [18:22:10] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 237 seconds [18:24:52] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [18:28:46] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 224 seconds [18:29:49] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [18:35:48] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 20.9923882609 (gt 8.0) [18:38:02] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.743 seconds [18:38:42] New review: Demon; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2575 [18:38:42] Change merged: Demon; [operations/software] (master) - https://gerrit.wikimedia.org/r/2575 [18:39:57] New patchset: Pyoungmeister; "perms: they matter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2612 [18:40:08] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [18:40:08] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [18:42:05] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:41] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 2 seconds [18:44:41] Reedy: Saibo: enabling all gadgets got no js problems but the green-on-black widget is funky and sidebar links can't be clicked :( [18:45:00] got to step out to get something for my wife ... bbiaf [18:45:07] :D [18:45:15] green-on-black.. emm.. [18:45:20] don't know that [18:45:51] Enabling all gadgets probably isn't the best way to go about it [18:45:52] "Use a black background with green text on the Monobook skin" [18:46:35] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 228 seconds [18:46:37] Reedy: it is how I got the js problems before, but yes, I'm expecting there will be weird behavior :) [18:46:45] :p [18:47:10] luckily there is no black-on-darkgrey gadget ;) [18:47:28] that together with green on black... [18:47:37] does anybody know https://en.wikipedia.org/wiki/Liferay ? [18:47:44] apparently it can integrate wikis (?) [18:47:53] New patchset: Pyoungmeister; "and let's append to a log file, shall we?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2613 [18:48:19] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2612 [18:48:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2612 [18:50:11] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [18:51:24] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2613 [18:51:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2613 [18:56:36] !log Running sync-l10nupdate by hand to see what kind of perms errors I get, first for 1.18 then for 1.19 [18:56:38] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.01438956522 (gt 8.0) [18:56:38] Logged the message, Mr. Obvious [19:02:19] !log Fixing permissions for php-1.18/cache/l10n on all apaches as root [19:02:21] Logged the message, Mr. Obvious [19:04:26] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.421 seconds [19:04:35] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.525 seconds [19:05:01] !log Rerunning l10nupdate by hand to hopefully fix CDB problems [19:05:04] Logged the message, Mr. Obvious [19:08:20] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:08:29] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:11] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.82914284483 [19:12:23] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.716 seconds [19:15:23] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 2 seconds [19:18:38] /usr/local/apache/common-local/php-1.17/extensions/WikiEditor/modules/./images/toolbar/loading.gif in /usr/local/apache/common-local/php-1.18/includes/resourceloader/ResourceLoaderFileModule.php on line 380 [19:18:53] PHP Warning: filemtime() [function.filemtime]: stat failed [19:19:17] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 204 seconds [19:19:26] Reedy: so 1.18 code is stating a 1.17 file [19:19:34] of course you nuked 1.17 ;( [19:19:39] heh [19:19:54] Oh [19:19:56] I nuked 1.18 [19:19:57] *1.17 [19:20:05] And that's probably just the deps cache in RL [19:20:11] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:24] AaronSchulz: Does it tell you which wiki generated that exception? [19:20:48] nope [19:20:54] hrmph [19:21:12] Hopefully it will go away as we move to 1.19 [19:21:24] But maybe it's just a minor bug in RL's dependency tracking code [19:21:26] hope & change [19:21:34] 1.19 =~ Obama? :D [19:21:53] !log LocalisationUpdate completed (1.18) at Wed Feb 15 19:21:53 UTC 2012 [19:21:54] !log LocalisationUpdate failed (1.19) at Wed Feb 15 19:21:54 UTC 2012 [19:21:55] Logged the message, Master [19:21:57] Logged the message, Master [19:23:20] RoanKattouw: still seeing exceptions [19:23:40] Boo! [19:23:48] Failed to write to file '/home/wikipedia/common/php-1.19/cache/l10n/l10nupdate-ab.cache' [19:23:50] wtf [19:24:15] srv193 testwiki: Error: invalid magic word 'contributiontotal' hehe [19:25:13] Local perms snafu on fenari, fixed [19:25:24] !log Fixed perms for /h/w/c/php-1.19/cache/l10n on fenari [19:25:26] Logged the message, Mr. Obvious [19:27:11] !log Manually running the LU script for 1.19 [19:27:14] Logged the message, Mr. Obvious [19:30:50] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.602 seconds [19:33:41] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 0 seconds [19:34:44] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:53] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.716 seconds [19:37:44] PROBLEM - MySQL Slave Delay on db31 is CRITICAL: CRIT replication delay 253 seconds [19:38:47] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:08] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.770 seconds [19:44:02] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:41] !log Syncing l10nupdate files for 1.19 [19:44:43] Logged the message, Mr. Obvious [19:46:47] Hmph, still getting CDB errors [19:46:52] But only from the scalers it seems [19:48:47] as before [19:49:46] !log catrope synchronized php-1.18/includes/Cdb_PHP.php 'Live hack for logging filenames in CDB errors' [19:49:48] Logged the message, Master [19:49:56] Yeah, you'd noticed that before but I hadn't [19:50:06] RoanKattouw: lol, I'm doing that in trunk now [19:50:07] The perms errors in sync-l10nupdate were also scaler-specific, thought you were talking about that [19:50:11] Good [19:50:17] Then you can port it to 1.19 and we won't need my live hack [19:52:34] Oh, I think I know what's going on [19:52:43] Stupid scalers with their stupid small /tmp [19:54:08] !log Deleting /tmp/mw-cache-*/l10nupdate-* on the image scalers [19:54:10] Logged the message, Mr. Obvious [19:54:59] RECOVERY - MySQL Slave Delay on db31 is OK: OK replication delay 0 seconds [19:57:17] !log I meant l10n_cache* of course [19:57:19] Logged the message, Mr. Obvious [19:59:34] AaronSchulz, Reedy, robla: OK looks like that fixed the CDB exceptions. All I'm seeing now is 2012-02-15 19:58:59 srv193 testwiki: Error: invalid magic word 'contributiontotal' [19:59:45] cool [19:59:51] Fundraising? [19:59:54] yup [20:00:22] RoanKattouw: you can probably close the bug tim logged then.. [20:00:40] link? [20:00:53] You gave me a link a while ago but I can't find it now [20:01:07] nm got it [20:01:13] cool [20:01:21] No, that's still a problem [20:01:23] The root cause hasn't really been fixe [20:01:25] d [20:01:34] ContributionReporting.i18n.magic.php [20:01:49] 'contributiontotal' => ( 0, 'contributiontotal' ) [20:03:19] almost sounds like it's not being registered.. [20:04:54] I'll bugzilla it [20:04:58] FR can deal with it ;) [20:05:26] RoanKattouw: might be worth commenting on the bug for ease of reference [20:05:38] I don't think it's related [20:05:49] We had 0-byte CDB files in the scalers' /tmp for some reaon [20:05:59] I removed them, and they were recreated correctly [20:06:05] Probably transient full disk or something [20:06:12] This problem is known already, I have no real information to add [20:06:23] It's all a bit cdb bluur [20:06:24] what should we do about bug 34397 [20:06:26] !b 34397 [20:06:26] https://bugzilla.wikimedia.org/show_bug.cgi?id=34397 [20:08:10] you guys are going to accuse me of rickrolling you, aren't you? [20:09:07] Do we have anyone on staff who cares about simple? ;) [20:09:51] I suggest trying the revert hammer here [20:09:57] RoanKattouw: were there that many 0 byte files? [20:10:01] Keep reverting stuff in simple until it stops breaking [20:10:11] Not all of them, but a bunch, yeah [20:10:19] * robla looks at what changed [20:10:23] Only on the scalers and only for l10n_cache*.cdb [20:11:10] PROBLEM - MySQL Slave Delay on db31 is CRITICAL: CRIT replication delay 203 seconds [20:11:19] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.796 seconds [20:11:19] well, cdbwriter doesn't kill the tmp file on exceptions [20:11:24] so they could hang around that way [20:11:40] Aha [20:11:44] That's probably it then [20:12:06] well, CdbWriter_PHP doesn't [20:12:16] chrismcmahon: I'm not seeing anything nearly as bad as your screenshot with test2+simple [20:12:41] robla: I saw the same thing in 3 browsers, checking [20:12:56] I wonder if this one is a cache problem [20:14:03] robla: My Preferences/Appearance/Simple/Preview still shows the issue. [20:14:20] oh...wait a sec, I'm seeing it now. I wasn't looking in the right spot [20:15:26] looks like it should be a simple css fix [20:15:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:40] * robla may just take a crack at it [20:20:41] New patchset: Pyoungmeister; "should make sure we don't get openjdk in there too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2614 [20:20:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.095 seconds [20:21:55] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2614 [20:21:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2614 [20:22:25] dvwiktionary: /w/index.php?title=Special:Export&useskin=monobook Exception from line 170 of /usr/local/apache/common-local/php-1.18/includes/Cdb_PHP.php: CdbReader_PHP::read: read from CDB file failed, file may be corrupted (/tmp/mw-cache-1.18/l10n_cache-dv.cdb) [20:22:30] RoanKattouw: :) [20:22:52] * RoanKattouw shakes fist [20:24:18] * RoanKattouw runs dsh -cM -g mediawiki-installation -- 'find /tmp/mw-cache* -size 0b -name "*.cdb"' [20:24:28] Hah, just that one file [20:24:43] removed it [20:25:07] http://www.mediawiki.org/w/index.php?title=Special:Code/MediaWiki&path=%2Ftrunk%2Fphase3%2Fincludes%2FCdb_PHP.php&status=new [20:26:28] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.786 seconds [20:27:06] hehee [20:27:10] We wrote almost the exact same code [20:28:04] soo, how long till those RL errors stop spamming the apache logs? [20:28:52] the apache has a real-life too now? ;) [20:29:23] * AaronSchulz is royally confused [20:30:03] Hmm [20:30:10] I need to remove all the old file refs from the DB once [20:30:15] I'll do that after 1.19 is fully deployed [20:30:26] They're harmless notices, just log spam [20:30:41] notices usually indicate incorrect code/assumptions :) [20:30:54] Yup [20:30:59] I think I fixed them all in 1.19 though [20:31:34] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.697 seconds [20:37:07] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:50:54] !log reedy synchronizing Wikimedia installation... : Pushing r111581 and making sure everything on the cluster is upto date [20:50:57] Logged the message, Master [20:51:26] Why is 1.19 trying to include VariablePage... [20:51:41] meh, fix that when scap is done [20:52:24] Reedy: you may also want to mft r111580. be scurred, I'm committing stuff [20:52:59] * Reedy wonders when robla become a MW frontend developer [20:53:32] sync done. [20:54:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.560 seconds [20:54:13] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.572 seconds [20:56:53] !log reedy synchronizing Wikimedia installation... : Rebuilt message lists and ExtensionMessages [20:56:55] Logged the message, Master [20:58:25] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:25] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:59:10] sync done. [21:02:19] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.242 seconds [21:02:19] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.240 seconds [21:03:30] New patchset: Jgreen; "mystery solved, stupid typo on new apache vhost config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2615 [21:04:22] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2615 [21:04:23] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2615 [21:06:22] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:23] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:58] RECOVERY - Puppet freshness on grosley is OK: puppet ran at Wed Feb 15 21:06:36 UTC 2012 [21:07:24] New patchset: Pyoungmeister; "search hosts now can use xfs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2616 [21:08:28] RECOVERY - Puppet freshness on aluminium is OK: puppet ran at Wed Feb 15 21:08:06 UTC 2012 [21:17:11] anyone that's not too busy, in API:Usercontribs, is ucstart broken for ucuserprefix, or am I just doing it wrong? See: https://en.wikipedia.org/w/api.php?action=query&list=usercontribs&uclimit=500&format=json&ucstart=2012-01-15T22:11:55Z&ucuserprefix=92.41.31. [21:18:41] You probably can't use those together [21:19:02] hmm, the docs don't say [21:19:23] I know, though they sort of hint at it [21:20:09] From reading the code it looks like it should work but sort by user then by timestamp [21:21:58] RECOVERY - Puppet freshness on gilman is OK: puppet ran at Wed Feb 15 21:21:36 UTC 2012 [21:24:01] New patchset: Hashar; "dumb test of gerrit / jenkins integration" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2617 [21:24:49] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.230 seconds [21:24:49] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.261 seconds [21:25:30] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/24/ (1/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2617 [21:25:31] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/25/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2617 [21:28:29] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/26/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2617 [21:28:34] PROBLEM - Disk space on mw44 is CRITICAL: DISK CRITICAL - free space: /tmp 9 MB (0% inode=87%): [21:28:43] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:31:16] RECOVERY - Disk space on mw44 is OK: DISK OK [21:32:46] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.652 seconds [21:32:55] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.961 seconds [21:40:22] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/27/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2617 [21:40:43] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:40:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:43:34] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.363 seconds [21:47:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:22] New review: jenkins-bot; "Build Started https://integration.mediawiki.org/ci/job/MediaWiki-GIT-Fetching/28/ (2/2)" [test/mediawiki/core2] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2617 [21:51:31] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.333 seconds [21:59:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:55] RECOVERY - MySQL Slave Delay on db31 is OK: OK replication delay 0 seconds [22:05:01] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 188 seconds [22:05:19] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 203 seconds [22:06:20] quick question, how do you edit a Wikipedia page that is +400,00 bytes, without your browser crashing? [22:07:44] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (18966) [22:08:29] there are five non-free files listed at http://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Register_of_Historic_Places/Images_without_refnum It is outside of articlespace, so it fails the non-free use criteria. However, I can't edit the page to remove them because it causes my browser to crash [22:10:57] there's some weird with a sul account [22:11:01] anyone interested? [22:17:18] https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AWiki-Sticker.JPG vs. https://commons.wikimedia.org/wiki/File:Wiki-Sticker.JPG#filehistory But: there was no file version on 20:41. [22:17:18] At that time I told the user to use the "upload" link on top of the desc page (there was written that the file page exists but not file version has been uploaded). I guess he did (according to his answer to me) but the log / file version times are totally off. [22:29:46] New review: Hashar; "(no comment)" [test/mediawiki/core2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2617 [22:29:47] Change merged: Hashar; [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2617 [22:44:31] my comments from above are now at https://bugzilla.wikimedia.org/show_bug.cgi?id=34427 [22:46:16] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.885 seconds [22:48:09] New patchset: Andre Engels; "My files; current status" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2618 [22:50:10] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:52:52] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.113 seconds [22:57:09] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.994 seconds [22:57:09] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:57:27] * AaronSchulz waits for Godot [22:57:36] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.910 seconds [23:01:30] !log disabling fundraising queue consumption on Aluminium due to jenkins build failure [23:01:30] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:01:32] Logged the message, Master [23:01:39] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:01:48] * AaronSchulz read 3:02pm on his watch [23:02:38] robla: are test2 uploads ok? [23:03:06] chrismcmahon: did you get around to testing uploads on test2? [23:03:30] robla: yes, I uploaded a big photo and messed with it. [23:03:58] AaronSchulz: there you have it. :) [23:04:06] one photo uploaded....ship it! [23:04:17] some final checks first ... [23:04:39] too late...it's already syncing enwiki [23:05:29] Noooo! [23:05:31] hehe [23:05:32] ;) [23:05:41] robla: I'm working on a fix for the bug that Ian and Aaron found [23:05:50] robla: I want an env I can control to test out $wgMaxImageArea in 1.19 [23:05:57] (new feature) [23:05:58] I can cut a major corner IF you guys don't deploy 1.19 anywhere until I'm done [23:06:31] robla: and a bunch of other stuff :-) [23:06:41] hold your silos...I mean horses! [23:09:27] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.232 seconds [23:10:24] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [23:10:28] AaronSchulz: what do you think of maplebed's idea of having ms5 send the exact right sort of ETag header that Swift needs, when it's sending a cached file? [23:10:51] how would he accomplish this? [23:10:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.134 seconds [23:11:14] I don't know [23:11:22] * maplebed has no idea either. [23:11:24] :D [23:12:41] I'd tell you how to make it work... [23:12:59] but I'm afraid that if I just say "reread my first email", it might sound like gloating ;) [23:14:51] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:17:21] * AaronSchulz chuckles at https://mikewest.org/2008/11/generating-etags-for-static-content-using-nginx [23:17:47] maplebed: I can't wait till we can hit the scalars directly, and all these problems then go away [23:20:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:23:42] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 602s [23:24:08] Joan: Just to confirm your not crazy https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#.23REDIRECT_redirects_to_an_older_version_if_not_logged_in [23:24:09] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 631s [23:24:54] PROBLEM - MySQL replication status on db1025 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 674s [23:25:15] p858snake|l: I had a suspicion it was redirect-related. [23:25:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.223 seconds [23:30:45] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.415 seconds [23:32:25] waiting for one more commit before getting started [23:36:40] maplebed: does rewrite.py also get killed when the client disconnects? [23:36:54] not sure. my guess is yes. [23:37:02] so it isn't just doing the self.copyconn.close() line [23:37:49] hooray! my bad-thumb-finder-and-purger works! [23:38:22] honestly I'm not sure when rewrite.py dies. [23:38:30] or how much control we have after the client disconnects. [23:38:47] * AaronSchulz really really hates this code [23:38:51] (which is why I love the etag thing - it offloads all that post-put logic into swift rather than in rewrite.) [23:38:51] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:51] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:40:01] AaronSchulz: you can see client aborts in the logs on ms-fe1 [23:40:39] RECOVERY - MySQL replication status on db1025 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [23:41:09] I'm not sure what the full data path is, but probably a subclass of urllib2.HTTPError gets thrown [23:41:20] when I tested server aborts, that's what happened [23:42:19] there's a try/except block wrapping the whole request handler that will catch any uncaught exceptions, but I didn't see any log entries from that [23:42:26] so it's probably being caught somewhere else in the call stack [23:44:51] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [23:45:27] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [23:47:37] !log re-enabled fundraising queue consumption after exorcism and rain dance [23:47:39] Logged the message, Master [23:50:51] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.032 seconds [23:50:51] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.043 seconds [23:52:03] PROBLEM - RAID on search10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:52:18] !log reedy synchronized php-1.19/resources/mediawiki/mediawiki.js 'r111599' [23:52:20] Logged the message, Master [23:53:20] !log reedy synchronized php-1.19/includes/OutputPage.php 'r111599' [23:53:22] Logged the message, Master [23:54:36] problem is gone [23:56:02] alright, time for mediawiki.org? [23:56:23] TimStarling, Reedy, AaronSchulz? [23:56:29] yes [23:56:42] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [23:56:43] i changed it in wikiversions.dat earlier [23:56:45] Reedy is master button presser right? [23:56:49] didn't sync it, obviously [23:56:49] haha [23:57:18] RECOVERY - RAID on search10 is OK: OK: 1 logical device(s) checked [23:57:23] * Reedy hits enter [23:57:27] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:45] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [23:57:56] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: mw.org to 1.19wmf1 [23:57:58] Logged the message, Master [23:59:55] Feb 15 23:59:34 10.0.2.191 apache2[2032]: PHP Warning: filemtime() [function.filemtime]: stat failed for /usr/local/apache/common-local/php-1.17/extensions/WikiEditor/modules/./images/toc/grip.png in /usr/local/apache/common-local/php-1.18/includes/resourceloader/ResourceLoaderFileModule.php on line 380