[00:00:20] * anomie notes that CodeEditor does not seem to like Monobook [00:00:56] New review: Legoktm; "Patch Set 2:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49704 [00:02:06] New patchset: Legoktm; "(bug 45083) Enable AbuseFilter IRC notifications on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49704 [00:16:30] anomie: Really? I haven't had any troubles on mediawiki.org using Monobook and CodeEditor. [00:16:34] What problems are you seeing? [00:17:09] Susan- Maybe it's something else then. "TypeError: $(...).data(...) is undefined" [00:17:57] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [00:18:57] https://en.wikipedia.org/wiki/Module:Bananas [00:19:01] (o: [00:20:06] TimStarling: Congrats on the deployment. :-) [00:20:16] thank you [00:21:26] https://en.wikipedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=828 [00:21:35] I guess that's "first!" [00:21:48] TimStarling: There was a request on WP:VPT about enabling importing from test2wiki. [00:21:51] Not sure if you saw it. [00:21:57] fp lol wtf bbq [00:22:13] https://en.wikipedia.org/wiki/Wikipedia:VPT#Import_from_test2.wikipedia.org [00:23:26] we should probably allow transwiki import from anywhere to anywhere [00:24:41] Yes, eventually. I think there are still some lurking transwiki bugs. [00:25:08] before centralauth, it made sense to restrict it [00:25:18] because it allowed you to forge your identity [00:25:28] Susan- I found the problem. One of my userscripts was blowing up on the lack of a Preview button in the Module namespace. Worked in Vector since I don't have that script there. D'oh. [00:26:03] Yeah, the lack of a preview button is a bit trippy. [00:26:13] And the tab key behavior is still driving me a little wild. [00:26:23] you mean it makes a tab? [00:26:25] There's a bug about making the escape key at least remove the focus from the textarea. [00:26:29] Yes. [00:26:36] that was the main reason I wanted CodeEditor, so that you could make tabs [00:26:50] Right. I have some deep muscle memory for using the tab key to switch from the textarea to the submit buttons, though. [00:26:59] Or the edit summary field. [00:27:14] So when I'm ready to move forward, I end up flailing. And eventually reaching for my mouse. [00:27:30] And usually cursing in the process. [00:27:53] https://bugzilla.wikimedia.org/show_bug.cgi?id=39649 is the relevant bug. [00:29:41] New patchset: Tim Starling; "Allow import from test2 to enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49794 [00:30:06] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [00:30:07] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49794 [00:30:08] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49794 [00:30:50] !log tstarling synchronized wmf-config/InitialiseSettings.php [00:30:51] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [00:30:52] Logged the message, Master [00:31:00] RECOVERY - Puppet freshness on labstore2 is OK: puppet ran at Tue Feb 19 00:30:50 UTC 2013 [00:39:07] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 24 seconds [00:42:28] https://test2.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=protocols - this is fixed in master, does it need to be backported and deployed? https://gerrit.wikimedia.org/r/#/c/49744/ [00:42:54] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [00:42:56] Logged the message, Master [00:44:42] TimStarling: Can you add srv193 back to the mediawiki-installation group please? Guessing it got removed when it's siblings were decomissioned [00:45:47] done [00:46:08] thanks [00:46:09] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [00:46:12] * Reedy waits for sync-common to run [00:46:26] Krenair- It would need to be, yes [00:50:30] Cherry-picked to 1.21wmf10 - https://gerrit.wikimedia.org/r/49795 [00:52:09] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [00:53:48] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 26.75 ms [00:57:33] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [00:59:45] anomie, ? [00:59:57] Krenair- ? [01:02:28] anomie, https://gerrit.wikimedia.org/r/49795 - I guess it needs someone with shell access to deploy it now [01:06:31] Krenair- Yeah. I can do that. [01:12:39] !log anomie synchronized php-1.21wmf10/includes/api/ApiQuerySiteinfo.php 'Backport fix for ugly API bug' [01:12:43] Logged the message, Master [01:13:37] Krenair- There you go [01:26:30] PROBLEM - Puppet freshness on sq41 is CRITICAL: Puppet has not run in the last 10 hours [01:28:54] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:40:46] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [01:42:34] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [01:43:55] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 223 seconds [01:44:58] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [01:50:49] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [01:57:16] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [01:58:19] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:09:16] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [02:16:01] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [02:27:25] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:28:26] !log LocalisationUpdate completed (1.21wmf9) at Tue Feb 19 02:28:25 UTC 2013 [02:28:28] Logged the message, Master [02:28:55] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:32:31] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [02:33:49] robla: around? [02:34:30] TimStarling: or you? [02:34:35] Hi Danny_B. [02:34:49] yes [02:34:54] hi always different girl [02:35:22] max [02:35:57] Danny_B: We thought it curious that most of the cs wikis wanted Scribunto, but not cs.wikipedia. [02:36:22] well, like I said, it's obvious enough why that was [02:37:05] i've sent czech namespace names to robla [02:37:15] but he perhaps forgot [02:37:15] I replied by email [02:37:27] ah... [02:37:29] to you and robla [02:37:54] * Danny_B felt asleep unfortunatelly because of being sick, just woke up now at 3:30am [02:38:53] re cswiki: they are very very conservative, they don't even like wikidata, most of the active users keep monobook etc... [02:40:01] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:41:53] TimStarling: how about that "script" name? various ppl asked why it is not that since module is sort of ambiguous and not descriptive for them (quoting) [02:42:54] you mean it's ambiguous in czech? or it's ambiguous in english? [02:43:58] I think the issue with the term "script" is that they aren't scripts [02:47:19] makes a sense, i'll try to forward that [02:49:02] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [02:49:33] !log LocalisationUpdate completed (1.21wmf10) at Tue Feb 19 02:49:32 UTC 2013 [02:49:35] Logged the message, Master [02:53:06] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [03:02:01] New review: Danny B.; "Patch Set 1: Code-Review+1" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/49681 [03:02:36] can somebody sync ^^ pls? [03:04:18] weird [03:06:04] it's already the same, isn't it? [03:06:23] i still see the old one [03:06:45] which is same as wp [03:06:55] = undistinguishable [03:07:18] yes, same as wikipedia [03:07:30] enwikt has different [03:08:14] (i actually think all wikts should have it, but someone has to decide that) [03:10:02] it seems ridiculous to me, to file a separate bug for each wiki and to list every single wiktionary in that file [03:10:57] same to me. but as i said above - someone else has to decide on central basis [03:11:25] well, I'm not going to push that out [03:12:21] # Test for making favicon.ico a script [03:12:21] RewriteRule ^/test-favicon\.ico$ /w/favicon.php [L] [03:12:34] argh [03:13:17] I had forgotten that that project was incomplete [03:13:39] incomplete = ? [03:23:59] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [03:32:57] TimStarling: what reason i should forward for not being set? wontfix / later / other? [03:37:27] I just have better things to do at the moment, maybe someone else will deploy it [03:38:50] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 194 seconds [03:39:08] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 206 seconds [03:39:53] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 239 seconds [03:40:02] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 245 seconds [03:42:26] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 7 seconds [03:42:44] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 2 seconds [03:43:29] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 29 seconds [03:43:38] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 30 seconds [03:43:44] TimStarling: ah ok, no prob. thought you actually rejected it. thx [03:52:42] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 182 seconds [03:53:00] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 192 seconds [03:58:06] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [03:59:18] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 190 seconds [03:59:18] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 190 seconds [04:01:42] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [04:07:24] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [04:33:20] New patchset: Tim Starling; "Move all favicons to bits" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49802 [04:34:24] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [04:37:33] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [04:38:27] PROBLEM - Puppet freshness on sq79 is CRITICAL: Puppet has not run in the last 10 hours [04:39:30] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [04:42:29] New patchset: Tim Starling; "(bug 45113) Set cswiktionary favicon to the same as enwiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49681 [04:43:10] New review: Tim Starling; "Patch Set 2:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49681 [04:46:51] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [04:48:48] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [05:19:42] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [06:04:44] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 23 seconds [06:06:32] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [06:19:17] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 8 seconds [06:19:44] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [07:33:37] PROBLEM - MySQL Replication Heartbeat on db46 is CRITICAL: CRIT replication delay 185 seconds [07:33:46] PROBLEM - MySQL Slave Delay on db46 is CRITICAL: CRIT replication delay 191 seconds [07:39:01] RECOVERY - MySQL Replication Heartbeat on db46 is OK: OK replication delay 0 seconds [07:39:19] RECOVERY - MySQL Slave Delay on db46 is OK: OK replication delay 0 seconds [08:05:01] PROBLEM - Puppet freshness on mw48 is CRITICAL: Puppet has not run in the last 10 hours [08:13:16] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [08:16:43] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [08:25:08] hello [08:28:22] hey :) [08:32:57] paravoid: I noticed your email about some contint module :-D [08:33:14] guess it is a bit too late for your to merge it now -:] but maybe tommorrow [08:33:48] how is your jet lag ? :-] [08:57:22] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 212 seconds [08:57:40] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 220 seconds [09:03:04] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 3 seconds [09:03:04] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 2 seconds [09:09:08] New patchset: Hashar; "Jenkins reprepo broken cause of double redirect" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49807 [09:11:43] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49807 [09:11:54] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49807 [09:13:48] aptmethod error receiving 'http://mirrors.jenkins-ci.org/debian-stable/binary/Release.gpg': [09:13:49] '404 Not Found' [09:13:49] There have been errors! [09:13:54] that's from checkupdate [09:13:58] :-( [09:14:09] aptmethod error receiving 'http://mirrors.jenkins-ci.org/debian-stable/binary/Release': [09:14:10] '404 Not Found' [09:14:17] not that either, so I guess http wo't get it [09:14:20] so the mirrors is not a package repository :( [09:14:30] ah maybe not [09:14:56] http://mirrors.jenkins-ci.org/debian-stable/ [09:15:03] yeah that's the dir with stuff in it [09:15:37] yeah it only got the packages [09:15:47] yep [09:15:56] the commit is wrong [09:16:04] New patchset: Hashar; "Revert "Jenkins reprepo broken cause of double redirect"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49808 [09:16:12] the redirects thing is fixed in the reprepro that we're running [09:16:14] apergos: and here is the revert https://gerrit.wikimedia.org/r/#/c/49808/ [09:16:38] you're supposed to review apergos ;) [09:16:44] New review: Hashar; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49808 [09:17:11] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49808 [09:17:22] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49808 [09:17:24] off for the night [09:17:29] bye [09:17:29] ;-D [09:17:34] have a good night Faidon! [09:18:00] apergos: so apparently there is an updated reprepo somewhere :-] [09:18:03] night [09:18:08] ah [09:18:21] on which machine have you tried repress ? [09:18:28] repress? [09:18:29] grrr [09:18:39] DAMN YOU APPLE AUTO SPELL CHECKER [09:18:39] I was on brewster (apt.wm.o) [09:19:01] according to Faidon: "the redirects thing is fixed in the reprepro that we're running" [09:19:04] but then hmm [09:22:46] apergos: I guess I will bug that , probably nothing we can do [09:22:57] ok, thanks [09:23:03] I don't have access on brewster to debug that :-D [09:23:38] 40 minutes to figure out the system that is supposed to makes life easier [09:23:42] :-( [09:24:10] root@brewster:~# apt-show-versions reprepro [09:24:11] reprepro/lucid-wikimedia uptodate 4.12.5-1~lucid1 [09:24:20] just so you have that [09:24:20] ah good [09:24:25] will post to ops list [09:24:31] Ibrewster runs lucid) [09:39:52] apergos: mail sent + bug opened to track the issue :-] [09:39:57] apergos: sorry for all the trouble :( [09:40:32] saw it [09:40:34] no worries [09:40:44] (saw the mail that is) [09:41:30] * apergos is trying the 'eat soup and see if it stays in' experiment today after yesterday's disasterous 'eat dinner' failure [09:44:53] will grab a coffee [09:44:55] and relax :-D [10:09:51] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [10:18:51] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [10:34:18] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 187 seconds [10:34:54] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 195 seconds [10:38:30] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [10:39:42] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [10:42:23] New patchset: Hashar; "jenkins: make Git plugin verbose" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49814 [10:42:46] would get that post lunch :-] [10:58:45] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:01:09] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 205 seconds [11:01:45] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 219 seconds [11:10:21] out for lunch [11:28:06] PROBLEM - Puppet freshness on sq41 is CRITICAL: Puppet has not run in the last 10 hours [11:29:34] !log installing package upgrads on neon/icinga [11:29:36] Logged the message, Master [12:10:21] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:24:36] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [12:25:12] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [12:28:48] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [12:33:00] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [12:33:54] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:44] !og powercycling ms-be7 (unresponsive to ping or from console) [12:38:36] apergos: l [12:38:55] ? [12:39:05] you missed the l [12:39:09] ah [12:39:14] * apergos tries again [12:39:22] !log powercycling ms-be7 (unresponsive to ping or from console) [12:39:23] Logged the message, Master [12:42:09] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [12:48:19] New patchset: Dzahn; "add wikimedia.jp.net and wikimediacommons.jp.net server aliases and redirects (RT-4356 etc.)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49822 [12:57:31] gerrit issue? [12:58:12] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [13:01:55] works again [13:09:50] New review: Dzahn; "Patch Set 1: Code-Review+2" [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/49822 [13:10:26] hashar: apache lint gives me Verified +1 but it still needs Verified +2 manually. is that as intended? [13:10:38] mutante: probably not [13:11:05] I guess it should V+2 just in ops/puppet.git [13:11:39] oh wait, it does say Verified+2 in the comment [13:11:48] but Gerrit says "Need Verified" when i submit [13:12:14] doesn't merge unless i do this now: [13:12:23] New review: Dzahn; "Patch Set 1: Verified+2" [operations/apache-config] (master); V: 2 - https://gerrit.wikimedia.org/r/49822 [13:12:24] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49822 [13:12:56] mutante: https://gerrit.wikimedia.org/r/49824 :-D [13:13:07] usually we vote V+1 on lint [13:13:11] and V+2 on unit tests [13:13:55] it's amazing how you always have an existing patch already :) [13:14:05] we did that for ops/puppet :-] [13:14:43] I have deployed the change [13:15:14] ah:) thanks [13:15:24] ideally I would want to run Jeff's integration test on Jenkins [13:16:02] should be possible by loading an apache process listening on port 80 on some loopback address (such as 127.0.0.80) [13:17:14] nods [13:20:09] !log gracefulling eqiad Apaches via dsh to push redirect change for .jp.net domains [13:20:10] Logged the message, Master [13:25:16] dzahn is doing a graceful restart of all apaches [13:25:54] !log dzahn gracefulled all apaches [13:25:56] Logged the message, Master [13:27:48] !log DNS update - adding wikimedia.jp.net [13:27:50] Logged the message, Master [13:29:21] !log wikimedia.jp.net now redirects to wikimedia.org | wikimediacommons.jp.net now redirects to commons.wm [13:29:22] Logged the message, Master [13:52:02] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 191 seconds [13:52:20] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 197 seconds [14:08:19] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [14:08:23] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [14:08:23] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [14:08:32] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [14:10:20] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 189 seconds [14:10:56] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [14:11:05] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 202 seconds [14:13:58] !g Ic166d23a78d9bd8e3d0fdc5017da4aabf28dbca9 [14:13:58] https://gerrit.wikimedia.org/r/#q,Ic166d23a78d9bd8e3d0fdc5017da4aabf28dbca9,n,z [14:20:23] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [14:22:27] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [14:23:11] mutante: there around ? :-D [14:23:22] yea [14:23:24] mutante: I got a hack of a hack for you :-] [14:23:25] https://gerrit.wikimedia.org/r/#/c/49708/ [14:23:34] role::cache::upload defaults to using squid [14:23:42] but sometime have varnish instead :-] [14:23:57] the hack would let us get varnish in labs instead of squid :-] [14:24:25] hopefully the A or B or C is not screwed hehe [14:24:48] but I am not sure whether you are willing to +2 a change related to varnish [14:25:00] another trivial one would be https://gerrit.wikimedia.org/r/49814 [14:25:09] which pass some argument to the jenkins service on startup :-] [14:27:49] confirmed that Jenkins options for verbose by quick search ;) [14:28:10] New review: Mark Bergsma; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49708 [14:28:20] ;D [14:28:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49708 [14:28:26] oh and mark is around \O/ [14:28:31] even better:) yay [14:28:36] just in time haha [14:28:56] mark: ideally I would have refactored the hack to something nicer but I am feeling lazy :-] [14:29:01] it's a hack already [14:29:02] New review: Dzahn; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49814 [14:29:04] it's going away soon [14:29:07] \O/ [14:29:10] so why care if you make it 20% worse [14:29:11] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49814 [14:29:17] I tried to get Lucid installed on a Precise box but that does not work [14:29:19] hashar: we should just watch logfile sizes was my only comment [14:29:23] why bother [14:29:28] upload squids are going away in weeks [14:29:48] yeah so eventually I found out that we have upload caches using varnish [14:29:58] eventually? ;) [14:30:03] there have been many mails about that hehe [14:30:06] <-- did not know [14:30:07] probably blog posts even [14:30:16] yeah I am sure about that [14:30:26] I wasn't sure of the progress [14:31:16] I will "just" have to tweak some lvs::configuration::lvs_service_ips stuff now :-] [14:38:30] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [14:39:33] PROBLEM - Puppet freshness on sq79 is CRITICAL: Puppet has not run in the last 10 hours [14:40:36] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [14:56:22] New patchset: Hashar; "beta: lvs upload IP set to the apache backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49830 [14:56:54] manganese has a huge load :( [14:56:55] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=manganese.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [14:57:01] started roughly 20 minutes ago [14:57:12] makes gerrit a bit slow :-] [14:58:00] hashar: saw that earlier, it was pretty busy for a few minutes, then calmed down again by itself [14:58:03] !log manganese CPU spike ongoing. Started roughly at 14:40 UTC [14:58:05] Logged the message, Master [14:58:13] I am wondering what was running there [14:58:22] it was then working on something in mw core git [14:58:27] hmm [14:58:35] maybe a garbage collection [14:58:46] and shortly after a bunch of reviews appeared at once in #mediawiki via gerrit-wm [14:59:38] mark: if you are still around, I could use a hack to the lvs::configuration hash for labs. We don't have LVS there so would need to specify the backends directly .. https://gerrit.wikimedia.org/r/#/c/49830/1/manifests/lvs.pp,unified [14:59:51] hashar: 05:10 < gerrit-wm> New review: Parent5446; "Patch Set 3: Code-Review-1" [mediawiki/core] (master) C: -1; - https://gerrit.wikimedia.org/r/45651 [14:59:54] 05:10 < gerrit-wm> New review: Parent5446; "Patch Set 9: Code-Review-1" [mediawiki/core] (master) C: -1; - https://gerrit.wikimedia.org/r/27022 [15:00:23] they are just reviews, should not cause any troubles [15:00:25] hashar: there were a whole bunch all with the same "Parent5446" at once, and it seemed like it took a while then finished them at once [15:00:27] though nobody knows :-] [15:00:30] <^demon> hashar: There's no garbage collection. [15:00:32] maybe ^demon was working on manganese [15:00:38] <^demon> Nor was I logged in. [15:00:44] okkk [15:00:57] so we just had a 20min CPU spike on manganese. [15:01:00] PROBLEM - MySQL Slave Delay on db44 is CRITICAL: CRIT replication delay 186 seconds [15:01:03] see gerrit-wm around 05:10 [15:01:10] might not be that much of an issue [15:01:28] but if that occurs often we will want to find out the root cause. [15:01:36] PROBLEM - MySQL Slave Delay on db1005 is CRITICAL: CRIT replication delay 196 seconds [15:01:38] I just hope it is not Jenkins putting too much pressure on the host [15:02:12] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: CRIT replication delay 237 seconds [15:02:13] !log restarting jenkins to enable some startup setting [15:02:15] Logged the message, Master [15:02:16] all i saw it was java and high load that went down again.. yep [15:02:48] PROBLEM - MySQL Replication Heartbeat on db1005 is CRITICAL: CRIT replication delay 240 seconds [15:04:08] <^demon> What the fuck? [15:04:15] <^demon> The hell is this. [15:05:30] New review: Mark Bergsma; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49830 [15:05:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49830 [15:06:06] \O/ [15:06:25] RECOVERY - MySQL Replication Heartbeat on db1005 is OK: OK replication delay 3 seconds [15:07:00] RECOVERY - MySQL Slave Delay on db1005 is OK: OK replication delay 1 seconds [15:07:32] <^demon> [2013-02-19 14:55:54,775] WARN org.eclipse.jetty.io.nio : Dispatched Failed! SCEP@151bbc99{l(/127.0.0.1:47351)<->r(/127.0.0.1:8080),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}-{AsyncHttpConnection@4ffaeba2,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0} to org.eclipse.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@3699d37e [15:07:48] <^demon> ^ Spammed the logs massively up until about 12m ago. [15:08:31] <^demon> qchris: Does that error make *any* bit of sense to you? [15:08:53] Nope. [15:09:25] What's on port 47351? [15:09:34] a random client port ? [15:09:56] hashar: :-) [15:09:57] I am wondering if it could be caused by some bot spamming the json api [15:09:59] <^demon> qchris: Nothing I've setup for gerrit, that's for sure... [15:10:15] Can it be apache forwarding the request to gerrit? [15:11:11] <^demon> [2013-02-19 12:55:22,464] was the timestamp of the first instance. [15:11:20] <^demon> [2013-02-19 14:55:54,775] is when it stopped. [15:11:46] It stopped on its own, or you killed some process/connection? [15:12:27] <^demon> I didn't kill anything, it had stopped already by the time I started tailing the log. [15:13:49] https://groups.google.com/forum/?fromgroups=#!topic/repo-discuss/TVhBSBHEJ9E [15:13:56] ^ Suggests to switch away from Jetty [15:16:34] <^demon> Can of worms I'd rather not open :) [15:16:45] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [15:17:21] PROBLEM - MySQL Slave Delay on db44 is CRITICAL: CRIT replication delay 184 seconds [15:17:45] However in that thread, Shawn Pearce says "This is a known problem with Jetty" [15:18:33] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: CRIT replication delay 191 seconds [15:18:53] <^demon> qchris: So the alternative is "use Tomcat?" :\ [15:19:01] :-)) [15:19:10] Or to live with the problem. [15:19:23] <^demon> Well, this is the first time this has ever hit us that I know of. [15:19:53] Once a year is not too bad is it? [15:21:27] <^demon> It's just jetty exploding. As long as it doesn't affect jgit and leave the repos in some weird inconsistent state... [15:34:06] New patchset: Hashar; "beta: lvs fake config for upload caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49831 [15:43:36] <^demon> qchris: https://code.google.com/p/gerrit/issues/detail?id=233 is such an annoying exception :\ [15:43:42] <^demon> Too bad there's no easy fix. [15:44:24] Yup. [15:44:44] So just to rule out other causes: It was always the same message [15:45:04] (same AsyncHttpConnection etc) [15:45:50] <^demon> The general error was the same, but some of the sha1s were a bit different. As were the ports. [15:46:08] The Jetty part on the Scaling page of gerrit Wiki even mentions our problem :-( [15:46:20] <^demon> I can pastebin a wider selection of the log (had to manually rotate already, it was 1G and the compression was blocking other processes) [15:46:31] Not necessary. [15:46:48] Thanks nontheless. [15:47:25] <^demon> http://p.defau.lt/?eTUz9C_CtCKxU2ijcUTE5g - here were the last 1000 entries. [15:50:35] <^demon> qchris: We may have very well outgrown the box we're on. We may need more resources and/or use tomcat. [15:50:57] <^demon> It was a beefy box at the time, but our gerrit usage has grown dramatically since then. [15:51:01] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [15:51:15] Do we have some data on the machine's load etc? [15:51:36] <^demon> Yeah, in ganglia. [15:51:50]