[00:00:20] * anomie notes that CodeEditor does not seem to like Monobook [00:00:56] New review: Legoktm; "Patch Set 2:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49704 [00:02:06] New patchset: Legoktm; "(bug 45083) Enable AbuseFilter IRC notifications on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49704 [00:16:30] anomie: Really? I haven't had any troubles on mediawiki.org using Monobook and CodeEditor. [00:16:34] What problems are you seeing? [00:17:09] Susan- Maybe it's something else then. "TypeError: $(...).data(...) is undefined" [00:17:57] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [00:18:57] https://en.wikipedia.org/wiki/Module:Bananas [00:19:01] (o: [00:20:06] TimStarling: Congrats on the deployment. :-) [00:20:16] thank you [00:21:26] https://en.wikipedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=828 [00:21:35] I guess that's "first!" [00:21:48] TimStarling: There was a request on WP:VPT about enabling importing from test2wiki. [00:21:51] Not sure if you saw it. [00:21:57] fp lol wtf bbq [00:22:13] https://en.wikipedia.org/wiki/Wikipedia:VPT#Import_from_test2.wikipedia.org [00:23:26] we should probably allow transwiki import from anywhere to anywhere [00:24:41] Yes, eventually. I think there are still some lurking transwiki bugs. [00:25:08] before centralauth, it made sense to restrict it [00:25:18] because it allowed you to forge your identity [00:25:28] Susan- I found the problem. One of my userscripts was blowing up on the lack of a Preview button in the Module namespace. Worked in Vector since I don't have that script there. D'oh. [00:26:03] Yeah, the lack of a preview button is a bit trippy. [00:26:13] And the tab key behavior is still driving me a little wild. [00:26:23] you mean it makes a tab? [00:26:25] There's a bug about making the escape key at least remove the focus from the textarea. [00:26:29] Yes. [00:26:36] that was the main reason I wanted CodeEditor, so that you could make tabs [00:26:50] Right. I have some deep muscle memory for using the tab key to switch from the textarea to the submit buttons, though. [00:26:59] Or the edit summary field. [00:27:14] So when I'm ready to move forward, I end up flailing. And eventually reaching for my mouse. [00:27:30] And usually cursing in the process. [00:27:53] https://bugzilla.wikimedia.org/show_bug.cgi?id=39649 is the relevant bug. [00:29:41] New patchset: Tim Starling; "Allow import from test2 to enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49794 [00:30:06] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [00:30:07] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49794 [00:30:08] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49794 [00:30:50] !log tstarling synchronized wmf-config/InitialiseSettings.php [00:30:51] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [00:30:52] Logged the message, Master [00:31:00] RECOVERY - Puppet freshness on labstore2 is OK: puppet ran at Tue Feb 19 00:30:50 UTC 2013 [00:39:07] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 24 seconds [00:42:28] https://test2.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=protocols - this is fixed in master, does it need to be backported and deployed? https://gerrit.wikimedia.org/r/#/c/49744/ [00:42:54] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [00:42:56] Logged the message, Master [00:44:42] TimStarling: Can you add srv193 back to the mediawiki-installation group please? Guessing it got removed when it's siblings were decomissioned [00:45:47] done [00:46:08] thanks [00:46:09] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [00:46:12] * Reedy waits for sync-common to run [00:46:26] Krenair- It would need to be, yes [00:50:30] Cherry-picked to 1.21wmf10 - https://gerrit.wikimedia.org/r/49795 [00:52:09] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [00:53:48] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 26.75 ms [00:57:33] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [00:59:45] anomie, ? [00:59:57] Krenair- ? [01:02:28] anomie, https://gerrit.wikimedia.org/r/49795 - I guess it needs someone with shell access to deploy it now [01:06:31] Krenair- Yeah. I can do that. [01:12:39] !log anomie synchronized php-1.21wmf10/includes/api/ApiQuerySiteinfo.php 'Backport fix for ugly API bug' [01:12:43] Logged the message, Master [01:13:37] Krenair- There you go [01:26:30] PROBLEM - Puppet freshness on sq41 is CRITICAL: Puppet has not run in the last 10 hours [01:28:54] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:40:46] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [01:42:34] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [01:43:55] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 223 seconds [01:44:58] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [01:50:49] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [01:57:16] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [01:58:19] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:09:16] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [02:16:01] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [02:27:25] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:28:26] !log LocalisationUpdate completed (1.21wmf9) at Tue Feb 19 02:28:25 UTC 2013 [02:28:28] Logged the message, Master [02:28:55] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:32:31] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [02:33:49] robla: around? [02:34:30] TimStarling: or you? [02:34:35] Hi Danny_B. [02:34:49] yes [02:34:54] hi always different girl [02:35:22] max [02:35:57] Danny_B: We thought it curious that most of the cs wikis wanted Scribunto, but not cs.wikipedia. [02:36:22] well, like I said, it's obvious enough why that was [02:37:05] i've sent czech namespace names to robla [02:37:15] but he perhaps forgot [02:37:15] I replied by email [02:37:27] ah... [02:37:29] to you and robla [02:37:54] * Danny_B felt asleep unfortunatelly because of being sick, just woke up now at 3:30am [02:38:53] re cswiki: they are very very conservative, they don't even like wikidata, most of the active users keep monobook etc... [02:40:01] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:41:53] TimStarling: how about that "script" name? various ppl asked why it is not that since module is sort of ambiguous and not descriptive for them (quoting) [02:42:54] you mean it's ambiguous in czech? or it's ambiguous in english? [02:43:58] I think the issue with the term "script" is that they aren't scripts [02:47:19] makes a sense, i'll try to forward that [02:49:02] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [02:49:33] !log LocalisationUpdate completed (1.21wmf10) at Tue Feb 19 02:49:32 UTC 2013 [02:49:35] Logged the message, Master [02:53:06] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [03:02:01] New review: Danny B.; "Patch Set 1: Code-Review+1" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/49681 [03:02:36] can somebody sync ^^ pls? [03:04:18] weird [03:06:04] it's already the same, isn't it? [03:06:23] i still see the old one [03:06:45] which is same as wp [03:06:55] = undistinguishable [03:07:18] yes, same as wikipedia [03:07:30] enwikt has different [03:08:14] (i actually think all wikts should have it, but someone has to decide that) [03:10:02] it seems ridiculous to me, to file a separate bug for each wiki and to list every single wiktionary in that file [03:10:57] same to me. but as i said above - someone else has to decide on central basis [03:11:25] well, I'm not going to push that out [03:12:21] # Test for making favicon.ico a script [03:12:21] RewriteRule ^/test-favicon\.ico$ /w/favicon.php [L] [03:12:34] argh [03:13:17] I had forgotten that that project was incomplete [03:13:39] incomplete = ? [03:23:59] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [03:32:57] TimStarling: what reason i should forward for not being set? wontfix / later / other? [03:37:27] I just have better things to do at the moment, maybe someone else will deploy it [03:38:50] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 194 seconds [03:39:08] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 206 seconds [03:39:53] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 239 seconds [03:40:02] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 245 seconds [03:42:26] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 7 seconds [03:42:44] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 2 seconds [03:43:29] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 29 seconds [03:43:38] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 30 seconds [03:43:44] TimStarling: ah ok, no prob. thought you actually rejected it. thx [03:52:42] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 182 seconds [03:53:00] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 192 seconds [03:58:06] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [03:59:18] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 190 seconds [03:59:18] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 190 seconds [04:01:42] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [04:07:24] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [04:33:20] New patchset: Tim Starling; "Move all favicons to bits" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49802 [04:34:24] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [04:37:33] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [04:38:27] PROBLEM - Puppet freshness on sq79 is CRITICAL: Puppet has not run in the last 10 hours [04:39:30] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [04:42:29] New patchset: Tim Starling; "(bug 45113) Set cswiktionary favicon to the same as enwiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49681 [04:43:10] New review: Tim Starling; "Patch Set 2:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49681 [04:46:51] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [04:48:48] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [05:19:42] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [06:04:44] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 23 seconds [06:06:32] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [06:19:17] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 8 seconds [06:19:44] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [07:33:37] PROBLEM - MySQL Replication Heartbeat on db46 is CRITICAL: CRIT replication delay 185 seconds [07:33:46] PROBLEM - MySQL Slave Delay on db46 is CRITICAL: CRIT replication delay 191 seconds [07:39:01] RECOVERY - MySQL Replication Heartbeat on db46 is OK: OK replication delay 0 seconds [07:39:19] RECOVERY - MySQL Slave Delay on db46 is OK: OK replication delay 0 seconds [08:05:01] PROBLEM - Puppet freshness on mw48 is CRITICAL: Puppet has not run in the last 10 hours [08:13:16] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [08:16:43] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [08:25:08] hello [08:28:22] hey :) [08:32:57] paravoid: I noticed your email about some contint module :-D [08:33:14] guess it is a bit too late for your to merge it now -:] but maybe tommorrow [08:33:48] how is your jet lag ? :-] [08:57:22] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 212 seconds [08:57:40] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 220 seconds [09:03:04] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 3 seconds [09:03:04] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 2 seconds [09:09:08] New patchset: Hashar; "Jenkins reprepo broken cause of double redirect" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49807 [09:11:43] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49807 [09:11:54] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49807 [09:13:48] aptmethod error receiving 'http://mirrors.jenkins-ci.org/debian-stable/binary/Release.gpg': [09:13:49] '404 Not Found' [09:13:49] There have been errors! [09:13:54] that's from checkupdate [09:13:58] :-( [09:14:09] aptmethod error receiving 'http://mirrors.jenkins-ci.org/debian-stable/binary/Release': [09:14:10] '404 Not Found' [09:14:17] not that either, so I guess http wo't get it [09:14:20] so the mirrors is not a package repository :( [09:14:30] ah maybe not [09:14:56] http://mirrors.jenkins-ci.org/debian-stable/ [09:15:03] yeah that's the dir with stuff in it [09:15:37] yeah it only got the packages [09:15:47] yep [09:15:56] the commit is wrong [09:16:04] New patchset: Hashar; "Revert "Jenkins reprepo broken cause of double redirect"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49808 [09:16:12] the redirects thing is fixed in the reprepro that we're running [09:16:14] apergos: and here is the revert https://gerrit.wikimedia.org/r/#/c/49808/ [09:16:38] you're supposed to review apergos ;) [09:16:44] New review: Hashar; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49808 [09:17:11] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49808 [09:17:22] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49808 [09:17:24] off for the night [09:17:29] bye [09:17:29] ;-D [09:17:34] have a good night Faidon! [09:18:00] apergos: so apparently there is an updated reprepo somewhere :-] [09:18:03] night [09:18:08] ah [09:18:21] on which machine have you tried repress ? [09:18:28] repress? [09:18:29] grrr [09:18:39] DAMN YOU APPLE AUTO SPELL CHECKER [09:18:39] I was on brewster (apt.wm.o) [09:19:01] according to Faidon: "the redirects thing is fixed in the reprepro that we're running" [09:19:04] but then hmm [09:22:46] apergos: I guess I will bug that , probably nothing we can do [09:22:57] ok, thanks [09:23:03] I don't have access on brewster to debug that :-D [09:23:38] 40 minutes to figure out the system that is supposed to makes life easier [09:23:42] :-( [09:24:10] root@brewster:~# apt-show-versions reprepro [09:24:11] reprepro/lucid-wikimedia uptodate 4.12.5-1~lucid1 [09:24:20] just so you have that [09:24:20] ah good [09:24:25] will post to ops list [09:24:31] Ibrewster runs lucid) [09:39:52] apergos: mail sent + bug opened to track the issue :-] [09:39:57] apergos: sorry for all the trouble :( [09:40:32] saw it [09:40:34] no worries [09:40:44] (saw the mail that is) [09:41:30] * apergos is trying the 'eat soup and see if it stays in' experiment today after yesterday's disasterous 'eat dinner' failure [09:44:53] will grab a coffee [09:44:55] and relax :-D [10:09:51] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [10:18:51] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [10:34:18] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 187 seconds [10:34:54] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 195 seconds [10:38:30] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [10:39:42] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [10:42:23] New patchset: Hashar; "jenkins: make Git plugin verbose" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49814 [10:42:46] would get that post lunch :-] [10:58:45] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:01:09] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 205 seconds [11:01:45] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 219 seconds [11:10:21] out for lunch [11:28:06] PROBLEM - Puppet freshness on sq41 is CRITICAL: Puppet has not run in the last 10 hours [11:29:34] !log installing package upgrads on neon/icinga [11:29:36] Logged the message, Master [12:10:21] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:24:36] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [12:25:12] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [12:28:48] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [12:33:00] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [12:33:54] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:44] !og powercycling ms-be7 (unresponsive to ping or from console) [12:38:36] apergos: l [12:38:55] ? [12:39:05] you missed the l [12:39:09] ah [12:39:14] * apergos tries again [12:39:22] !log powercycling ms-be7 (unresponsive to ping or from console) [12:39:23] Logged the message, Master [12:42:09] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [12:48:19] New patchset: Dzahn; "add wikimedia.jp.net and wikimediacommons.jp.net server aliases and redirects (RT-4356 etc.)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49822 [12:57:31] gerrit issue? [12:58:12] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [13:01:55] works again [13:09:50] New review: Dzahn; "Patch Set 1: Code-Review+2" [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/49822 [13:10:26] hashar: apache lint gives me Verified +1 but it still needs Verified +2 manually. is that as intended? [13:10:38] mutante: probably not [13:11:05] I guess it should V+2 just in ops/puppet.git [13:11:39] oh wait, it does say Verified+2 in the comment [13:11:48] but Gerrit says "Need Verified" when i submit [13:12:14] doesn't merge unless i do this now: [13:12:23] New review: Dzahn; "Patch Set 1: Verified+2" [operations/apache-config] (master); V: 2 - https://gerrit.wikimedia.org/r/49822 [13:12:24] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49822 [13:12:56] mutante: https://gerrit.wikimedia.org/r/49824 :-D [13:13:07] usually we vote V+1 on lint [13:13:11] and V+2 on unit tests [13:13:55] it's amazing how you always have an existing patch already :) [13:14:05] we did that for ops/puppet :-] [13:14:43] I have deployed the change [13:15:14] ah:) thanks [13:15:24] ideally I would want to run Jeff's integration test on Jenkins [13:16:02] should be possible by loading an apache process listening on port 80 on some loopback address (such as 127.0.0.80) [13:17:14] nods [13:20:09] !log gracefulling eqiad Apaches via dsh to push redirect change for .jp.net domains [13:20:10] Logged the message, Master [13:25:16] dzahn is doing a graceful restart of all apaches [13:25:54] !log dzahn gracefulled all apaches [13:25:56] Logged the message, Master [13:27:48] !log DNS update - adding wikimedia.jp.net [13:27:50] Logged the message, Master [13:29:21] !log wikimedia.jp.net now redirects to wikimedia.org | wikimediacommons.jp.net now redirects to commons.wm [13:29:22] Logged the message, Master [13:52:02] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 191 seconds [13:52:20] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 197 seconds [14:08:19] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [14:08:23] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [14:08:23] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [14:08:32] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [14:10:20] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 189 seconds [14:10:56] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [14:11:05] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 202 seconds [14:13:58] !g Ic166d23a78d9bd8e3d0fdc5017da4aabf28dbca9 [14:13:58] https://gerrit.wikimedia.org/r/#q,Ic166d23a78d9bd8e3d0fdc5017da4aabf28dbca9,n,z [14:20:23] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [14:22:27] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [14:23:11] mutante: there around ? :-D [14:23:22] yea [14:23:24] mutante: I got a hack of a hack for you :-] [14:23:25] https://gerrit.wikimedia.org/r/#/c/49708/ [14:23:34] role::cache::upload defaults to using squid [14:23:42] but sometime have varnish instead :-] [14:23:57] the hack would let us get varnish in labs instead of squid :-] [14:24:25] hopefully the A or B or C is not screwed hehe [14:24:48] but I am not sure whether you are willing to +2 a change related to varnish [14:25:00] another trivial one would be https://gerrit.wikimedia.org/r/49814 [14:25:09] which pass some argument to the jenkins service on startup :-] [14:27:49] confirmed that Jenkins options for verbose by quick search ;) [14:28:10] New review: Mark Bergsma; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49708 [14:28:20] ;D [14:28:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49708 [14:28:26] oh and mark is around \O/ [14:28:31] even better:) yay [14:28:36] just in time haha [14:28:56] mark: ideally I would have refactored the hack to something nicer but I am feeling lazy :-] [14:29:01] it's a hack already [14:29:02] New review: Dzahn; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49814 [14:29:04] it's going away soon [14:29:07] \O/ [14:29:10] so why care if you make it 20% worse [14:29:11] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49814 [14:29:17] I tried to get Lucid installed on a Precise box but that does not work [14:29:19] hashar: we should just watch logfile sizes was my only comment [14:29:23] why bother [14:29:28] upload squids are going away in weeks [14:29:48] yeah so eventually I found out that we have upload caches using varnish [14:29:58] eventually? ;) [14:30:03] there have been many mails about that hehe [14:30:06] <-- did not know [14:30:07] probably blog posts even [14:30:16] yeah I am sure about that [14:30:26] I wasn't sure of the progress [14:31:16] I will "just" have to tweak some lvs::configuration::lvs_service_ips stuff now :-] [14:38:30] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [14:39:33] PROBLEM - Puppet freshness on sq79 is CRITICAL: Puppet has not run in the last 10 hours [14:40:36] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [14:56:22] New patchset: Hashar; "beta: lvs upload IP set to the apache backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49830 [14:56:54] manganese has a huge load :( [14:56:55] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=manganese.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [14:57:01] started roughly 20 minutes ago [14:57:12] makes gerrit a bit slow :-] [14:58:00] hashar: saw that earlier, it was pretty busy for a few minutes, then calmed down again by itself [14:58:03] !log manganese CPU spike ongoing. Started roughly at 14:40 UTC [14:58:05] Logged the message, Master [14:58:13] I am wondering what was running there [14:58:22] it was then working on something in mw core git [14:58:27] hmm [14:58:35] maybe a garbage collection [14:58:46] and shortly after a bunch of reviews appeared at once in #mediawiki via gerrit-wm [14:59:38] mark: if you are still around, I could use a hack to the lvs::configuration hash for labs. We don't have LVS there so would need to specify the backends directly .. https://gerrit.wikimedia.org/r/#/c/49830/1/manifests/lvs.pp,unified [14:59:51] hashar: 05:10 < gerrit-wm> New review: Parent5446; "Patch Set 3: Code-Review-1" [mediawiki/core] (master) C: -1; - https://gerrit.wikimedia.org/r/45651 [14:59:54] 05:10 < gerrit-wm> New review: Parent5446; "Patch Set 9: Code-Review-1" [mediawiki/core] (master) C: -1; - https://gerrit.wikimedia.org/r/27022 [15:00:23] they are just reviews, should not cause any troubles [15:00:25] hashar: there were a whole bunch all with the same "Parent5446" at once, and it seemed like it took a while then finished them at once [15:00:27] though nobody knows :-] [15:00:30] <^demon> hashar: There's no garbage collection. [15:00:32] maybe ^demon was working on manganese [15:00:38] <^demon> Nor was I logged in. [15:00:44] okkk [15:00:57] so we just had a 20min CPU spike on manganese. [15:01:00] PROBLEM - MySQL Slave Delay on db44 is CRITICAL: CRIT replication delay 186 seconds [15:01:03] see gerrit-wm around 05:10 [15:01:10] might not be that much of an issue [15:01:28] but if that occurs often we will want to find out the root cause. [15:01:36] PROBLEM - MySQL Slave Delay on db1005 is CRITICAL: CRIT replication delay 196 seconds [15:01:38] I just hope it is not Jenkins putting too much pressure on the host [15:02:12] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: CRIT replication delay 237 seconds [15:02:13] !log restarting jenkins to enable some startup setting [15:02:15] Logged the message, Master [15:02:16] all i saw it was java and high load that went down again.. yep [15:02:48] PROBLEM - MySQL Replication Heartbeat on db1005 is CRITICAL: CRIT replication delay 240 seconds [15:04:08] <^demon> What the fuck? [15:04:15] <^demon> The hell is this. [15:05:30] New review: Mark Bergsma; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49830 [15:05:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49830 [15:06:06] \O/ [15:06:25] RECOVERY - MySQL Replication Heartbeat on db1005 is OK: OK replication delay 3 seconds [15:07:00] RECOVERY - MySQL Slave Delay on db1005 is OK: OK replication delay 1 seconds [15:07:32] <^demon> [2013-02-19 14:55:54,775] WARN org.eclipse.jetty.io.nio : Dispatched Failed! SCEP@151bbc99{l(/127.0.0.1:47351)<->r(/127.0.0.1:8080),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}-{AsyncHttpConnection@4ffaeba2,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0} to org.eclipse.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@3699d37e [15:07:48] <^demon> ^ Spammed the logs massively up until about 12m ago. [15:08:31] <^demon> qchris: Does that error make *any* bit of sense to you? [15:08:53] Nope. [15:09:25] What's on port 47351? [15:09:34] a random client port ? [15:09:56] hashar: :-) [15:09:57] I am wondering if it could be caused by some bot spamming the json api [15:09:59] <^demon> qchris: Nothing I've setup for gerrit, that's for sure... [15:10:15] Can it be apache forwarding the request to gerrit? [15:11:11] <^demon> [2013-02-19 12:55:22,464] was the timestamp of the first instance. [15:11:20] <^demon> [2013-02-19 14:55:54,775] is when it stopped. [15:11:46] It stopped on its own, or you killed some process/connection? [15:12:27] <^demon> I didn't kill anything, it had stopped already by the time I started tailing the log. [15:13:49] https://groups.google.com/forum/?fromgroups=#!topic/repo-discuss/TVhBSBHEJ9E [15:13:56] ^ Suggests to switch away from Jetty [15:16:34] <^demon> Can of worms I'd rather not open :) [15:16:45] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [15:17:21] PROBLEM - MySQL Slave Delay on db44 is CRITICAL: CRIT replication delay 184 seconds [15:17:45] However in that thread, Shawn Pearce says "This is a known problem with Jetty" [15:18:33] PROBLEM - MySQL Replication Heartbeat on db44 is CRITICAL: CRIT replication delay 191 seconds [15:18:53] <^demon> qchris: So the alternative is "use Tomcat?" :\ [15:19:01] :-)) [15:19:10] Or to live with the problem. [15:19:23] <^demon> Well, this is the first time this has ever hit us that I know of. [15:19:53] Once a year is not too bad is it? [15:21:27] <^demon> It's just jetty exploding. As long as it doesn't affect jgit and leave the repos in some weird inconsistent state... [15:34:06] New patchset: Hashar; "beta: lvs fake config for upload caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49831 [15:43:36] <^demon> qchris: https://code.google.com/p/gerrit/issues/detail?id=233 is such an annoying exception :\ [15:43:42] <^demon> Too bad there's no easy fix. [15:44:24] Yup. [15:44:44] So just to rule out other causes: It was always the same message [15:45:04] (same AsyncHttpConnection etc) [15:45:50] <^demon> The general error was the same, but some of the sha1s were a bit different. As were the ports. [15:46:08] The Jetty part on the Scaling page of gerrit Wiki even mentions our problem :-( [15:46:20] <^demon> I can pastebin a wider selection of the log (had to manually rotate already, it was 1G and the compression was blocking other processes) [15:46:31] Not necessary. [15:46:48] Thanks nontheless. [15:47:25] <^demon> http://p.defau.lt/?eTUz9C_CtCKxU2ijcUTE5g - here were the last 1000 entries. [15:50:35] <^demon> qchris: We may have very well outgrown the box we're on. We may need more resources and/or use tomcat. [15:50:57] <^demon> It was a beefy box at the time, but our gerrit usage has grown dramatically since then. [15:51:01] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [15:51:15] Do we have some data on the machine's load etc? [15:51:36] <^demon> Yeah, in ganglia. [15:51:50] Ja, sure. Sorry :-( [15:51:57] <^demon> eg: http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=manganese.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [15:52:31] mutante: my dad still uses a fax :) [15:54:55] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [15:54:59] jeremyb_: hehe, for a while i had some SIP provider offering me FoIP ,but even that ..never used anyways http://en.wikipedia.org/wiki/Internet_fax#Fax_using_Voice_over_IP [15:56:20] highlights "Not all fax equipment claiming T.38 compliance works reliably;" ..yea, as usual :p [15:56:25] mutante: i assume that's just t.38? [15:56:30] which makes it even more annoying [15:56:40] back later tonight [15:58:27] jeremyb_: yea,that, plus letting you upload PDFs or even use a virtual printer driver [15:58:39] yeah [15:58:44] someone want to comment on RT 2675 ? [15:59:17] btw, just to be clear: unread msgs in RT are unread per user? or marking read is read for everyone? [16:02:00] jeremyb_: per user, because it's a setting you can change in your profile [16:02:03] Notify me of unread messages [16:02:10] oh, huh [16:02:11] Yes/No/Default [16:08:42] ^demon: When browsing through ganglia, I do not get the impression that we have outgrown the machine yet :-( [16:09:07] <^demon> Perhaps not. [16:10:44] So should we prepare a tomcat test setup, or do we rather wait and see if it occurs again? [16:11:26] seriously chad [16:11:35] a box like manganese should be AMPLE for something like gerrit [16:20:52] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [16:23:47] <^demon> mark: yeah, nevermind me. [16:24:46] but the same could be said for e.g. puppet... [16:25:07] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [16:47:54] <^demon> qchris: For now, we'll just keep an eye on it. If it becomes a problem, we can start playing with Tomcat. [16:48:05] <^demon> Hopefully this was just a one-off thing. [16:48:09] Ok. Sounds good. [17:28:50] New patchset: Silke Meyer; "Fixing broken dependencies for updating mw-extensions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49836 [17:29:18] Change abandoned: Silke Meyer; "Sorry. I redid that change on a fresh clone..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48474 [18:03:21] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [18:06:12] PROBLEM - Puppet freshness on mw48 is CRITICAL: Puppet has not run in the last 10 hours [18:07:42] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [18:13:54] New review: Andrew Bogott; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49232 [18:14:05] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49232 [18:16:02] New review: Andrew Bogott; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49836 [18:16:12] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49836 [18:17:46] New review: Andrew Bogott; "Patch Set 4: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48979 [18:17:55] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48979 [18:18:00] paravoid, yt? [18:24:50] MaxSem: hey [18:25:20] paravoid, what did the Solr affar end with? I still see the old plugin in Nagios [18:27:21] the "attempt 2" is merged afaik [18:27:33] but puppet not run? [18:28:01] not sure [18:28:06] let me force run puppet on spence [18:28:58] paravoid, don't [18:29:16] I forgot to rerererevert solr.pp [18:29:29] will do that shortly [18:29:45] dammit, it's hard to think at 6AM [18:30:05] :( [18:34:08] New patchset: MaxSem; "Changes forgotten in https://gerrit.wikimedia.org/r/#/c/49372/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49843 [18:35:06] paravoid, Guest54769 ^^ [18:52:15] RECOVERY - MySQL Replication Heartbeat on db44 is OK: OK replication delay 0 seconds [18:52:42] RECOVERY - MySQL Slave Delay on db44 is OK: OK replication delay 0 seconds [19:03:02] New patchset: Legoktm; "(bug 45165) Create rollbacker group for wikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49847 [19:12:54] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [19:50:03] paravoid: got any time ahead to merge the contint modules ? :-D [19:50:10] paravoid: though I guess it is almost lunch time in SF [19:53:40] GuidedTour broke again on the wikis that are using 1.21wmf10. [19:53:59] That's currently testwiki, test2wiki, MW, and Wikidata. [19:54:09] I'm going to do a submodule update and syncdir to fix it. [19:57:30] is GuidedTour enabled on wikidata? [19:58:27] duh, no. [19:58:33] ok [19:59:13] Those are just the wikis that have 1.21wmf10. [19:59:27] But we could probably enable it on Wikidata later if that is requested. [19:59:33] !log disabling project storage volumes for all labs projects with no instances [19:59:35] Logged the message, Master [20:04:11] !log mflaschen synchronized . 'Syncing GuidedTour submodule omitted when new branch cut was first cloned.' [20:04:12] Logged the message, Master [20:05:24] New review: Hashar; "Patch Set 1: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/49678 [20:13:58] hrm, when did notpeter's last name get the suffix? [20:14:13] the what? [20:14:19] the cloak? [20:14:20] Youngmeisterarius [20:15:06] oh, hahaha [20:15:13] https://bugzilla.wikimedia.org/show_bug.cgi?id=43663#c5 [20:15:18] I htink I demanded that someone else set up my bz account when i first started [20:15:24] and that was the result [20:15:59] do you get a crown? [20:17:06] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 185 seconds [20:17:33] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 194 seconds [20:19:39] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [20:21:10] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 17 seconds [20:22:31] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:22:39] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [20:26:42] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [20:48:18] New patchset: RobH; "fixing nagios false positive error for tower b in ps1-c3-sdtpa & ps1-d2-sdtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49915 [20:50:34] jenkins bot, why you so slowwwww [20:50:52] * RobH blames and pings ^demon cuz well, he feels like being a jerk. [20:50:59] ;] [20:51:15] RobH: might be zuul / jenkins [20:51:24] it did vote :-] [20:51:27] did everyone hate jenkinsbot outputting to channel? [20:51:36] yep, took awhile, i miss it pinging channel. [20:51:41] (i know its not done that for awhile now) [20:51:43] <^demon> RobH: Yuppp, jenkinsbot spams lots, so we hacked him out :) [20:51:51] put it bacccccckkkk [20:51:59] <^demon> More people will hate me. [20:52:05] but I will love you [20:52:09] put back for this place ? :-D [20:52:14] my love is worth the hate of millions. [20:52:53] New review: RobH; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49915 [20:53:01] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49915 [20:53:14] gerrit was faster when no one but ops was using it quite yet ;] [20:53:46] <^demon> gerrit sucked back then. [20:53:51] that might be because there are too many jenkins jobs running, you can check on https://integration.mediawiki.org/ci/ [20:53:52] <^demon> it was soooo bad. [20:53:54] <^demon> and ugly. [20:54:05] is LeslieCarr around? [20:54:10] there is six slots right now, if they are filled in, the jobs validation have to wait a bit :-D [20:54:18] matanya: leslie is out sick [20:54:19] New patchset: Andrew Bogott; "Fix a bug with orphan volume tracking" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49916 [20:54:28] hashar: ohh, good link, thx for info [20:54:34] * matanya is sorry to hear [20:54:37] now i will pull up that page and stare at it in hate when waiting on it [20:55:33] mutante, can I get your eyes on https://gerrit.wikimedia.org/r/#/c/47026/ sometime? No rush, it's just a housekeeping patch. [20:56:43] can someone review https://gerrit.wikimedia.org/r/49843 please? [20:59:42] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:04:48] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [21:05:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [21:06:45] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 206 seconds [21:07:30] test.m.wikipedia.org is returning 503's, which is blocking a MobileFrontend deployment [21:07:34] can someone please take a look? [21:09:56] Ryan_Lane, paravoid, mutante, mark, etc? ^ [21:12:22] awjr - the folks here just left for lunch ... [21:12:39] will ask them to look into it when they get back [21:13:17] woosters: well we are in our deployment window right now - is anyone available to look into it who's not in the office? [21:15:44] it is late in Europe [21:16:56] woosters, omg now all ops are in one place - what if a plane falls on them or something?:P [21:17:19] luckily we have maxsem [21:17:21] :-) [21:17:45] * MaxSem hides [21:18:08] how about a volcano??! [21:18:13] woosters: is that a 'no' then? if so we'll reschedule, but i'd like to let the team know asap as people are just waiting around right now [21:18:18] hey [21:18:21] I'm back [21:18:22] looking [21:18:25] hi paravoid, thakns :) [21:18:35] they just walked in [21:19:08] awjr - was there a request to have ops on standby for this release? [21:19:43] woosters: nope, but our staging environment is currently down [21:20:18] no worries ... Ops is here :-) [21:20:22] \o/ [21:20:54] we're very close to not needing testwiki anymore - once we have everything finished on betalabs :D [21:24:20] New review: Milimetric; "Patch Set 1: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/49710 [21:29:37] PROBLEM - Puppet freshness on sq41 is CRITICAL: Puppet has not run in the last 10 hours [21:29:50] paravoid: any luck? [21:32:15] paravoid: Hmm, I thought this had been merged and the Nagios warnings fixed already? https://gerrit.wikimedia.org/r/#/c/41819/ [21:32:21] sec. :) [21:32:49] hey... anyone who works on article feedback around? [21:36:13] New patchset: Faidon; "Fix test.m.wikipedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49925 [21:37:36] c'mon jenkins [21:38:01] hmm [21:38:01] that one pass [21:38:05] the tests are completed [21:38:12] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49925 [21:38:12] somehow, Zuul waits somewhere [21:38:20] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49925 [21:38:24] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/41819 [21:38:32] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41819 [21:40:11] RoanKattouw: strange, it was already merged it seems [21:41:49] paravoid: Duplicate maybe? [21:42:26] Oh, submitted Dec 31 [21:42:32] Maybe I submitted a duplicate of it later, I don't know [21:43:32] awjr: works now? [21:43:33] yes [21:43:38] k [21:43:40] paravoid, thanks! [21:43:41] sorry for the trouble. [21:43:56] paravoid: lgtm, thanks for getting that fixed :) [21:49:31] notice: Skipping run of Puppet configuration client; administratively disabled; use 'puppet Puppet configuration client --enable' to re-enable. [21:49:38] wtf. [21:49:44] Why would mw48 have this set? [21:50:34] should just be an application server. [21:50:37] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [21:51:04] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [21:52:39] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49843 [21:52:50] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49843 [21:54:05] !log mw48 was set to puppet agent disabled, re-enabled and fired puppet run [21:54:06] Logged the message, RobH [21:54:33] RECOVERY - Puppet freshness on mw48 is OK: puppet ran at Tue Feb 19 21:54:24 UTC 2013 [21:58:26] oooffff [22:00:04] PROBLEM - Host ms-be12 is DOWN: PING CRITICAL - Packet loss = 100% [22:01:03] hrm [22:01:09] apergos: that you? [22:01:32] or sbernardin [22:02:26] sbernardin [22:02:39] matthiasmullie: hi matthias, I'm really interested in getting this deployed: https://gerrit.wikimedia.org/r/#/c/43833/ what is the next step in the process for that? [22:02:44] ms-be11 is at 0 now also, as requested [22:02:56] thanks. [22:03:30] we currently have a couple of slaves that are broken as a result of this issues, including the research slave, so getting this deployed asap would be excellent [22:03:32] please let me know what else needs to happen (and when I can start putting ms-be11 back in) [22:04:13] * aude wonders when deployments are next week [22:04:30] * aude does timezone math [22:05:10] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [22:10:07] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 147439 seconds [22:11:18] !log preilly synchronized php-1.21wmf9/extensions/ArticleFeedback 'fix mysql 5.5 / ansi sql incompatibility' [22:11:20] Logged the message, Master [22:11:49] preilly: \o/ [22:11:55] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [22:12:03] AaronSchulz: thanks [22:14:01] RECOVERY - Puppet freshness on sq79 is OK: puppet ran at Tue Feb 19 22:13:47 UTC 2013 [22:14:55] RECOVERY - MySQL Slave Running on db59 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [22:15:04] RECOVERY - Puppet freshness on sq80 is OK: puppet ran at Tue Feb 19 22:14:54 UTC 2013 [22:15:14] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [22:15:21] !log sq79/80 puppet in locked state, manually restarted puppet, removed lock file, ran manually [22:15:22] Logged the message, RobH [22:15:57] !log taking down ms-be12 for hardware replacement per: https://rt.wikimedia.org/Ticket/Display.html?id=4546 [22:15:58] Logged the message, Master [22:17:07] !log sq41 not allowing ssh, drac resetting [22:17:08] Logged the message, RobH [22:17:19] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [22:17:21] Apergos: you need me? [22:17:44] ah sorry, no, just sayin that you were powering off bs-be12, yay and thanks [22:17:44] Apergos: just got back down here [22:17:56] *ms-be12 [22:18:04] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 150093 seconds [22:18:19] Apergos: going to setup drac and you'll be all set [22:18:46] Apergos: will ping you when I'm done [22:19:15] ok (I'll be gone by then but I'll see it tomorrow) [22:19:17] RECOVERY - ps1-d2-sdtpa-infeed-load-tower-A-phase-Y on ps1-d2-sdtpa is OK: ps1-d2-sdtpa-infeed-load-tower-A-phase-Y OK - 1538 [22:19:20] New patchset: Andrew Bogott; "Make manage-volumes more aware of project states." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49916 [22:20:31] !log sq41 puppet freshness bad, wont accept ssh connections, serial redirection results in no output. will need to reboot it [22:20:33] Logged the message, RobH [22:21:41] TimStarling: hello [22:21:49] hi [22:24:05] PROBLEM - Host sq41 is DOWN: PING CRITICAL - Packet loss = 100% [22:24:46] ya ya ya nagios we konw. [22:25:00] though its serial output doesnt show it coming back up =/ [22:29:41] !log sq41 offline, rt 4550 [22:29:42] Logged the message, RobH [22:31:01] heads up: scapping [22:41:40] !log preilly synchronized php-1.21wmf9/extensions/ArticleFeedback 'fix MAX call with GREATEST' [22:41:41] Logged the message, Master [22:41:53] preilly: thank you! [22:42:01] dberror log looks quiet now [22:47:20] binasher: np [22:47:27] binasher: sweet [22:49:01] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [22:49:02] Logged the message, Master [22:49:08] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [22:59:54] MaxSem: [22:59:55] Feb 19 22:59:00 10.64.0.93 apache2[32400]: PHP Fatal error: require() [function.require]: Failed opening required '/usr/local/apache/common-local/php-1.21wmf9/extensions/MobileFrontend/includes/MFCompatCheck.php' (include_path='/usr/local/apache/common-local/php-1.21wmf9/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/usr/local/apache/common-local/php-1. [22:59:55] 21wmf9:/usr/local/lib/php:/usr/share/php') in /usr/local/apache/common-local/php-1.21wmf9/includes/AutoLoader.php on line 1161 [23:00:17] And they're still coming [23:00:45] grrr [23:00:55] out of sync again? [23:01:25] reedy@fenari:/home/wikipedia/common$ sync-file php-1.21wmf9/extensions/MobileFrontend/includes/MFCompatCheck.php [23:01:25] Could not open input file: /home/wikipedia/common/php-1.21wmf9/extensions/MobileFrontend/includes/MFCompatCheck.php [23:01:25] Aborted due to syntax errors [23:01:41] yup, it was removed [23:02:26] Reedy, doing an out-of-band sync [23:03:17] !log maxsem synchronized php-1.21wmf9/extensions/MobileFrontend/ [23:03:18] Logged the message, Master [23:04:01] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [23:04:02] Logged the message, Master [23:05:01] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [23:05:55] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [23:09:01] MaxSem: solr checks are failing... [23:09:04] PROBLEM - Solr on solr1003 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:09:22] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 197 seconds [23:09:31] PROBLEM - Solr on solr1002 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:09:32] duh [23:09:40] PROBLEM - Solr on vanadium is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:09:49] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 210 seconds [23:10:07] PROBLEM - Solr on solr1001 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:10:07] PROBLEM - Solr on solr3 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:10:43] PROBLEM - Solr on solr1 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:10:52] PROBLEM - Solr on solr2 is CRITICAL: (Return code of 127 is out of bounds - plugin may be missing) [23:12:01] paravoid, the plugin should be there, judging by manifest - any ideas what's wrong? [23:12:55] /usr/bin/env: python2.7: No such file or directory [23:13:12] any particular reason to depend on 2.7? [23:13:33] the XML lib is borked in 2.6 [23:13:50] what does that mean? [23:14:28] originally, the hashbang pointed at generic python [23:14:45] there's no python 2.7 on this system. [23:15:09] however, when 2.6 was default, the script ran with errors [23:16:22] PROBLEM - NTP on mw1085 is CRITICAL: NTP CRITICAL: Offset unknown [23:16:25] New patchset: awjrichards; "Set wgMFPhotoUploadWiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49946 [23:17:15] New review: MaxSem; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49946 [23:18:04] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49946 [23:18:24] paravoid, if it's not possible to install 2.7 while keeping 2.6 default to prevent breaking the other stuff, I'll need to rewrite it [23:18:49] spence is lucid, there's no python 2.7 in lucid. [23:19:08] stone age:P [23:19:58] RECOVERY - NTP on mw1085 is OK: NTP OK: Offset -0.0007953643799 secs [23:21:23] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/49946/' [23:21:24] Logged the message, Master [23:21:30] awjr, ^^ [23:25:58] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 14 seconds [23:26:16] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [23:29:20] New patchset: MaxSem; "Don't check average query times for now cause it's broken on Python 2.6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49947 [23:30:27] New patchset: MaxSem; "Don't check average query times for now cause it's broken on Python 2.6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49947 [23:30:36] paravoid, ^ [23:30:41] sec [23:31:26] New patchset: Pyoungmeister; "db-pmtpa.php: commenting out db43 from s6 for dist upgrade/mariadb" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49948 [23:36:23] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/49948 [23:36:23] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49948 [23:36:57] whee, more mariadb is great [23:38:06] yep! [23:38:59] !log py synchronized wmf-config/db-pmtpa.php 'removing db43 from db-secondary for upgrades and maria' [23:39:00] Logged the message, Master [23:42:57] New patchset: Pyoungmeister; "mariafication of db43" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49950 [23:44:21] New review: Faidon; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49947 [23:44:30] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49947 [23:44:43] PROBLEM - mysqld processes on db43 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:47:11] !log do-release-upgrade-ing on db43 [23:47:13] Logged the message, notpeter [23:51:28] RECOVERY - MySQL disk space on neon is OK: DISK OK