[00:00:05] !log mflaschen synchronized wmf-config/extension-list 'Remove E3Experiments and LastModified' [00:00:11] Logged the message, Master [00:01:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [00:03:36] !log tstarling synchronized wmf-config/ExtensionMessages-1.22wmf1.php [00:03:43] Logged the message, Master [00:06:49] !log all 1.22wmf1 wikis briefly showed error messages due to display_errors being enabled when mergeMessageFileList.php generated ExtensionMessages-1.22wmf1.php. Fixed temporarily. [00:06:56] Logged the message, Master [00:07:14] .... temporarily? [00:08:07] TimStarling: temporarily? That sounds ominous [00:08:47] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 00:08:40 UTC 2013 [00:09:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:09:47] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 00:09:37 UTC 2013 [00:10:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:10:37] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 00:10:32 UTC 2013 [00:11:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:11:27] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 00:11:24 UTC 2013 [00:12:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:12:47] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 00:12:45 UTC 2013 [00:13:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:19:35] !log bsitu Finished syncing Wikimedia installation... : Update Echo to master [00:19:42] Logged the message, Master [00:20:29] !log tstarling synchronized wmf-config/ExtensionMessages-1.22wmf1.php [00:20:36] Logged the message, Master [00:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.854 second response time [00:25:07] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [00:29:39] !log on fenari: testing a new version of mw-update-l10n which doesn't use stdout of mergeMessagesFileList.php [00:29:46] Logged the message, Master [00:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [00:32:57] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 00:32:48 UTC 2013 [00:33:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [00:43:27] New patchset: Tim Starling; "Don't use stdout of mergeMessageFileList.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57697 [00:46:01] !log csteipp synchronized php-1.22wmf1/includes/ [00:46:08] Logged the message, Master [00:48:47] !log csteipp synchronized php-1.21wmf12/includes/ [00:48:54] Logged the message, Master [01:06:06] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [01:10:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [01:32:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:33:06] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 01:32:56 UTC 2013 [01:33:06] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [01:33:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [01:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.151 second response time [01:57:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [02:06:29] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [02:12:34] !log LocalisationUpdate completed (1.22wmf1) at Fri Apr 5 02:12:34 UTC 2013 [02:12:43] Logged the message, Master [02:23:11] !log LocalisationUpdate completed (1.21wmf12) at Fri Apr 5 02:23:11 UTC 2013 [02:23:18] Logged the message, Master [02:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.165 second response time [02:57:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [03:03:49] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [03:21:54] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:31:49] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [03:46:12] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:46:12] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:46:12] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:42] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:56:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [04:00:42] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [04:06:33] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:08:23] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 04:08:20 UTC 2013 [04:08:33] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:09:23] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 04:09:13 UTC 2013 [04:09:33] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:09:53] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 04:09:52 UTC 2013 [04:10:33] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:10:33] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 04:10:27 UTC 2013 [04:11:33] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [04:12:20] that....looks wrong [04:16:13] PROBLEM - Squid on brewster is CRITICAL: Connection refused [04:26:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:28:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [04:28:43] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [04:30:33] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [04:32:43] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [04:34:43] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 04:34:35 UTC 2013 [06:05:21] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [06:10:31] RECOVERY - Squid on brewster is OK: TCP OK - 0.027 second response time on port 8080 [06:11:21] PROBLEM - Puppet freshness on db1042 is CRITICAL: No successful Puppet run in the last 10 hours [06:32:51] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 06:32:47 UTC 2013 [06:33:21] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [07:05:36] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [07:32:02] New review: Siebrand; "Consider updating the commit summary to the new convention." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57498 [07:37:03] New patchset: Faidon; "Puppetize supervisor configs for EventLogging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56692 [07:37:40] siebrand: what's the new convention? [07:38:59] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56692 [07:41:50] New patchset: Legoktm; "(bug 46840) Update TTMServer Solr schema" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57498 [07:41:54] paravoid: ^ [07:42:59] oh [07:43:03] it needs a space? [07:43:06] New patchset: Faidon; "Fix eventlogging.conf template filename" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57716 [07:43:20] New patchset: Legoktm; "(bug 46840) Update TTMServer Solr schema" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57498 [07:43:20] that doesn't sound very DRY to me [07:43:26] mentioning the bug twice [07:43:28] dry? [07:43:33] i mean [07:43:37] you can remove it from the first line [07:43:51] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57716 [07:44:06] I have no idea what the new convention is :) [07:44:31] fight the system! [07:54:58] ori-l: supervisor + two fixups merged; I'd really like to see the template having more stuff parameterized, like not hardcoding db1047 & vanadium. [08:05:26] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [08:08:06] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 08:08:04 UTC 2013 [08:08:26] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [08:08:36] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 08:08:32 UTC 2013 [08:09:26] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [08:10:45] New review: Aaron Schulz; "(1 comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57697 [08:12:56] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [08:22:52] New patchset: Aaron Schulz; "Lowered fork count to 1 due to the procs being lowered and also not an even number now." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57717 [08:28:45] hashar: hey [08:29:08] hashar: I see a bunch of deb reviews for me, but I'm not sure if you actually want a review or are there for you and ottomata to work on [08:30:07] * Aaron|home gives 57717 to paravoid [08:32:56] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 08:32:48 UTC 2013 [08:33:26] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [08:38:27] paravoid: hey sorry lagging. [08:41:46] New review: Faidon; "(5 comments)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/50452 [08:41:59] paravoid: so for python-statsd, I think we got everything covered. I had to update the upstream branch which had an invalid import [08:42:04] (list https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/debs/python-statsd,n,z ) [08:42:34] and for python-voluptuous two changes to support git build-package :-] [08:43:04] New review: Faidon; "Could you be a little more verbose on why this is needed? I can just trust you and merge it but I'm ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57717 [08:43:07] Aaron|home: ^ [08:43:32] using 2 means there are only up to 6 procs instead of the nominal 7 [08:44:45] it also means that there can be only 3 for ~2 min or so worst case (due to a sibling proc finishing but one sibling still going on, both children from the fork) [08:45:29] previously we had 12 nominal procs total, so the lowest at any time was roughly ~6 (except when there are no jobs of course) [08:47:49] !log Jenkins no more receiving ANY events from Gerrit. Effect: no jobs are triggered. [08:47:56] Logged the message, Master [08:49:33] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57717 [08:52:17] New review: Faidon; "No, you shouldn't rely on files/ from a module. You could put it in modules/geoip/files, but I don't..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53714 [08:56:10] paravoid: that's excellent, thanks a lot! yeah, i'm on a bit of a puppet kick lately with the vagrant stuff so i'm motivated to improve the eventlogging module too [08:58:25] !log for Jenkins / Gerrit issue see {{bug|46917}} [08:58:32] Logged the message, Master [09:04:01] hashar: hm, yes, nothing on stream-events for me either [09:04:07] i updated a commit just to see [09:04:16] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [09:04:18] I am not sure what happened in Gerrit :/ [09:04:25] I can't even find the Gerrit git repository [09:04:54] it used to be operations/gerrit.git [09:05:44] ssh w/'gerrit query' returns results [09:09:46] so turns out Gerrit has been updated yesterday https://gerrit.wikimedia.org/r/#/c/57508/ which probably introduced a regression [09:12:07] yeah, i'm already downloading the war files to diff :P [09:12:30] at a glance, nothing in the gerrit bug tracker looks relevant [09:13:31] we updated to gerrit to 2.6-rc0-144-gb1dadd2 but I can't find that sha1 in the repo :( [09:14:21] must be in some local repo :( [09:21:34] speaking of gerrit, can someone review https://gerrit.wikimedia.org/r/57647 please? [09:27:46] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [09:32:38] !log restarted Zuul [09:32:46] Logged the message, Master [09:35:35] apergos: paravoid: seems Gerrit is screwed somehow. It is no more emitting events to Zuul since 4am (at least) [09:35:44] uh oh [09:35:49] apergos: paravoid: could one of you please restart gerrit on manganese? [09:35:54] sec [09:36:49] I think ^demon upgraded it yesterday [09:36:55] ah you already mentioned that [09:37:02] it's in the middle of restarting [09:37:09] danke [09:37:11] ori-l: yeah, so we have another piece of infrastructure that uses supervisord [09:37:16] ori-l: that's vumi, the mobile SMS stuff [09:37:23] I suspect we have some kind of regression in Gerrit :-( [09:37:45] done [09:38:07] apergos: that worked! :-] [09:38:16] oh yay [09:38:28] when in doubt, reboot it :-P [09:38:41] well, you mentioned ~500 errors in the zuul log [09:38:52] !log Ariel kindly restarted gerrit on manganese which unbroken the stream-events process in Zuul. It again receives events. [09:38:54] i wonder if it kept choking on some input and gerrit ran out of connections [09:38:59] Logged the message, Master [09:39:08] ori-l: yeah gerrit gives out 500 when it restarts [09:39:12] anything else you need while I'm on that box? [09:39:25] apergos: no thanks :-] [09:39:26] apergos: +2 on operations/puppet plz [09:39:35] I guess ^demon will look at it whenever he connects [09:39:37] ..or not [09:39:44] ori-l: can't do that from manganese :-P [09:39:59] hashar: oh, i misunderstood, you meant HTTP 500 [09:40:04] i thought you meant 'five hundred errors' [09:40:24] ah sorry :-] [10:05:57] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [10:09:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:09:41] New patchset: ArielGlenn; "update location of interwiki.cdb link in noc" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57725 [10:10:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [10:13:07] PROBLEM - Puppet freshness on db1043 is CRITICAL: No successful Puppet run in the last 10 hours [10:13:43] New review: JanZerebecki; "Yes for the rule below that comment it is correct as it contains %-escaped characters:" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [10:20:59] New review: JanZerebecki; "Sorry that reply was too quickly written. It should read: AFAIK without characters in the rule that ..." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [10:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [10:26:07] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [10:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:31:34] New patchset: Aude; "(bug 45005) Redirect wikidata.org to www.wikidata.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [10:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [10:35:09] New review: Aude; "In the patch, wikidata main.conf is more consistent with other domains like wikivoyage and handles r..." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [11:04:25] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [11:06:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:07:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [11:07:58] New patchset: Aude; "(bug 45005) Redirect wikidata.org to www.wikidata.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [11:20:18] New patchset: Aude; "(bug 45005) Redirect wikidata.org to www.wikidata.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [11:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [11:30:25] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [11:31:25] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [11:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:53:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [11:56:21] <^demon> apergos: Thanks for restarting gerrit. I'm not entirely sure what was broken here though. [11:56:47] <^demon> Everything seems to WFM now, and I'm not seeing any errors in the log from before the restart that look suspicious. [11:56:50] that's a fine question, and I don't know either [11:57:21] <^demon> There's a minor bug with the BZ integration, but that's unrelated (and happening both before & after the restart) [11:57:29] he said zuul was not emitting events from 4 am on or maybe earlier [11:57:50] is it possible that something rotated wihtout restart? (I'm assuming that's 4am utc but don't know) [11:58:55] <^demon> From that timeframe, all I've got is replication failures for one repo (known, also unrelated to stream-events) [11:59:33] weird [11:59:48] well when he's back around maybe you can get a bit more info [12:00:20] in the meantime.... [12:00:25] <^demon> I'm supposed to be leaving for a 3 day weekend soon... [12:00:40] care to sign off on an update on a link for the download of the inerwiki cdb file? [12:00:41] :-P [12:00:43] https://gerrit.wikimedia.org/r/#/c/57725/ [12:00:48] oh sounds nice [12:00:51] <^demon> This *is* serious if it's happening, but it's *working* now [12:01:03] it is working and he thought that it was fixed now and done [12:02:16] <^demon> Hmm, well bug still being open and set to High/Major was kind of worrying when I woke up :) [12:03:12] mhm, does anyone know why we aren't using multithreaded localisation updates? are they broken? [12:03:48] ah, I didn't know it had been left open. hm [12:04:59] New review: Demon; "Doesn't that create an nfs dependency since we're linking to /h/w/c/? How about linking to the versi..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57725 [12:05:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:08:26] we were already inking to it, some other files in there do the same thing [12:08:37] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 12:08:33 UTC 2013 [12:09:00] it's only for fenari [12:09:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:09:37] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 12:09:31 UTC 2013 [12:10:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:10:27] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 12:10:20 UTC 2013 [12:10:32] dbtree and pybal are both such symlinks (/home/wikipedia/ stuff) in /home/wikipedia/common/docroot/noc ^demon [12:11:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:11:47] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 12:11:40 UTC 2013 [12:12:14] <^demon> I figured some stuff still did, but no point in continuing the trend ;-) [12:12:23] ok sure [12:12:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:12:46] <^demon> Makes it more annoying if we want to setup a noc.wm.o replacement in eqiad without nfs. [12:12:49] <^demon> :) [12:13:00] it will be :-D [12:13:53] <^demon> Ok, I should finish packing. I'll be in and out for a bit. [12:14:00] <^demon> If hashar returns, ping loudly :) [12:16:07] New patchset: ArielGlenn; "update location of interwiki.cdb link in noc" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57725 [12:16:13] q [12:16:22] oops :) [12:17:05] there is a new patchset for the next time you get sick f packing :-) [12:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [12:25:00] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57725 [12:32:57] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 12:32:48 UTC 2013 [12:33:27] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [12:37:20] <^demon> apergos: /msg? [12:39:40] <^demon> Nevermind, actually. [12:40:59] eh? [12:41:25] and yeah whenever you like (sorry was trying to getpastthisbugbeforeIfroget [12:41:26] ) [12:52:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [13:04:23] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [13:07:06] Hey, is there a usable misc server in eqiad I can use to hammer on my NFS server setup? [13:22:24] there are unused labs boxes in eqiad aren't there [13:31:48] New patchset: Mark Bergsma; "Add cp3006 to the pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57739 [13:32:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57739 [13:33:37] New review: Hashar; "Seems like a good workaround. You might also want to report the issue upstream if you have any cont..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57426 [13:40:09] mark: I dunno, hence my asking. :-) [13:40:41] mark: Unless you're talking about the labstore boxen -- in which case those are the one I want to test. :-) [13:41:09] And testing NFS on the server kinda defeats the purpose. [13:42:14] there must be more [13:42:20] afaik there is a complete labs cluster in eqiad [13:43:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [13:45:39] mark: Oh! I need to speak to Ryan about that since pmtpa is obsolescent. [13:45:52] yes [13:46:14] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:46:14] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:46:14] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:46:23] New patchset: Hashar; "package-builder learned 'cowbuilder'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [13:47:06] New review: Hashar; "PS7 now refers to pbuilder instead of builder :-} Would eventually get it a module later on, mayb..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [14:04:44] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [14:06:24] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:15:14] New patchset: Mark Bergsma; "Allow dysprosium to be installed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57744 [14:16:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57744 [14:30:44] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:31:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [14:32:24] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [14:38:51] hi mark, have you made your mind on the mobile resources issue? what do we need to proceed? [14:53:49] New review: JanZerebecki; "This does not fix all the problems mentioned in my comment from 2013-04-02 22:06." [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/49069 [15:05:36] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [15:21:04] New review: Aude; "without the .org in ServerName is consistent with how most of the other domains are done, like wikiv..." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [15:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [15:29:35] New review: Aude; "I am curious why some vhosts are in www.wikipedia.conf and some in main.conf" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [15:30:06] New patchset: Mark Bergsma; "Add configuration for dysprosium (varnish performance test box)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57749 [15:32:24] HELLO JENKINS [15:32:43] hashar mentioned a problem after yesterday's gerrit upgrade [15:32:46] see ops@ [15:33:01] apergos restarted gerrit before and it was fixed, maybe it's broken again [15:33:38] i have a suspicion that puppet won't like my change [15:33:55] oh well [15:34:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57749 [15:36:05] * Damianz watches the world explode [15:36:55] mark: I would put it in a variable if I were you [15:37:07] what? [15:37:24] PROBLEM - RAID on dysprosium is CRITICAL: NRPE: Command check_raid not defined [15:37:39] turns out that it does work [15:37:44] ok [15:37:45] good to know [15:37:54] PROBLEM - DPKG on dysprosium is CRITICAL: Timeout while attempting connection [15:38:04] PROBLEM - Disk space on dysprosium is CRITICAL: Timeout while attempting connection [15:39:54] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [15:40:24] RECOVERY - RAID on dysprosium is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:40:34] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [15:40:54] RECOVERY - DPKG on dysprosium is OK: All packages OK [15:41:04] RECOVERY - Disk space on dysprosium is OK: DISK OK [16:01:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [16:05:46] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:08:46] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 16:08:44 UTC 2013 [16:09:46] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:09:56] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 16:09:51 UTC 2013 [16:10:46] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:10:56] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 16:10:52 UTC 2013 [16:11:46] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:11:56] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 16:11:46 UTC 2013 [16:12:06] PROBLEM - Puppet freshness on db1042 is CRITICAL: No successful Puppet run in the last 10 hours [16:12:46] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:13:16] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 16:13:15 UTC 2013 [16:13:46] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:13:46] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 16:13:45 UTC 2013 [16:14:46] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:20:26] so, bikeshedding discussion on where to place a new project in gerrit. it's a varnish vmod. obvious options seem to be things like ops/software/foo, ops/varnish/foo, ops/foo, etc? [16:20:47] thoughts? do whatever makes me happy? [16:22:20] my suggestion would operations/varnish/vmod-foo or operations/vmod-foo [16:22:20] bblack: my varnish fu is kinda weak but i'm thinking maybe we already have one? [16:22:22] mark: you might care [16:22:25] for geoip? [16:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:40] and so you can copy where that lives :) [16:22:51] ideally operations/debs/varnish-vod-foo ;-) [16:22:56] vmod [16:22:58] I voted against operations/debs/ [16:23:01] we have a deb for a modified varnish, but I don't think we have a vmod separately [16:23:03] why? [16:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.141 second response time [16:23:16] because it's the source of a project, a deb is another layer outside of that [16:23:25] it's been used before but I'd prefer using that namespace for debianization of existing sources, not for projects we are upstream [16:23:40] so then we need two repos? [16:23:54] no, we can have a debian branch in the same repo [16:24:07] just for the debian/ subdir, I guess? [16:24:08] but operations/debs/ is a bit misleading I think [16:24:09] then operations/software/varnish/vmod-foo [16:24:17] i don't think it's misleading, but whatever [16:24:26] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:24:50] works for me [16:25:14] ok :) [16:25:48] while we're on this, so the vmod maps client ip, or client networks, to headers. I suck at creative naming. I had vmod_ciphdrs . maybe vmod_nethdrs? [16:26:07] or something that's not so ugly [16:26:11] to headers? [16:26:19] are you going to set the header from the vmod? [16:26:27] I was planning to, yes :) [16:27:00] the interface for passing data back and forth to VCL sucks anyways, but whether the header is set from vmod or passed back and set from VCL is relatively minor point [16:27:39] would you rather the vmod just returned an array of strings and the VCL user of the vmod decides what to do with them? [16:27:49] I was thinking something like set req.http.X-CS = carrier.lookup(client.ip) [16:27:59] well, but they want to set multiple headers [16:28:14] the default language one? [16:28:22] X-CS and X-Carrier [16:28:38] why would we set X-Carrier? [16:28:45] this the old one, before X-CS existed [16:29:03] I got a page but icinga-wm is quiet [16:29:04] I figured they were using that so a human could tell what was going on when debugging, because they make more sense than the raw MCC-MNC [16:29:12] mark: it's upload, that you? [16:29:26] that's probably me [16:30:22] paravoid: it would make life simpler if the generic idea of the module was client-ip mapped to a string. It would get messy if they later do want more than 1 header, though. [16:30:58] there's no real composite return types, we'd have to stuff them in the string with in-band delimiters or redo the VMOD's API a different way [16:31:15] yeah, we briefly discussed about that on the list [16:31:19] very briefly [16:31:26] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [16:31:33] this was when they asked for x-dfltlang, but we decided to rip all that logic out of varnish [16:32:05] whatever you decide, I have no strong opinion [16:32:14] well, in theory the MCC-MNC is the only thing then, right? anything else would be dependent on that and could be looked up elsewhere? [16:32:31] I think so [16:32:55] I'll ask them when I ask all these other questions later, then [16:33:02] andrewbogott: i wonder what to do about 4853. i have 2 potential guinea pigs to test if greg-g is special in some way. (other people that are brand new and don't yet have wmf group) [16:33:03] yurik might be around :) [16:33:36] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 16:33:33 UTC 2013 [16:33:43] yurik is one of my guinea pigs [16:33:46] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [16:33:48] if the vmod can set headers, the logic is all gone from vcl [16:34:19] anyways, my current test code is set up such that the JSON associates an array of multiple headers with an array of multiple networks, directly in the vmod. [16:34:40] both alternatives are different kinds of generically-powerful (vs just returning a string and dealing with it in VCL) [16:36:07] New patchset: Mark Bergsma; "Add IPv6 address to dysprosium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57751 [16:36:22] yes [16:36:24] jeremyb_ I haven't responded to that ticket because I don't have good ideas how to troubleshoot. If you have specific suggestions I'm happy to help. [16:37:38] andrewbogott: I'm here and available to do things as needed :) [16:37:39] andrewbogott: i do. apache error log for one. but also add yurik to wmf and get him to try the same hosts [16:37:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57751 [16:38:17] andrewbogott: unfortunately my LDAP+apache troubleshooting days are like 4 years ago so i don't remember too much [16:38:20] mark: the sort-of downside to headers in the VMOD, though, is it's just powerful enough that eventually people would want more out of it (like, setting all the different types of headers (req, backend, etc), and options for different handling of overlapping networks with different sets of headers, etc..). I can already see that I just have to limit it to what we need and that others will eventually say "Why can't it also do X?". If the vmod j [16:38:21] maps to a string, it's simple and stable and people can mess with the other logic at the VCL layer where it's easy. [16:38:41] > X?". If the vmod j [16:38:52] you were cut off [16:39:04] this client should know better :P [16:39:27] If the vmod just maps to a string, it's simple and stable and people can mess with the other logic at the VCL layer where it's easy. [16:39:34] that's true [16:39:42] of course the vmod doesn't need to care which headers it operates on [16:39:48] exactly [16:39:54] name of headers is configuration [16:40:07] that's my main objection [16:40:11] sure [16:40:14] can perhaps pass req.http or beresp.http or whatever as a param? [16:40:47] the name of the headers, and which of the 4-5 types of headers you mean, and whether you're setting the same 3 headers for each set of networks you match, etc…. is all going to be config and complexity for the vmod, eventually with the vmod setting headers directly. [16:40:59] mark: can pass an equivalent indicator somehow, for sure. [16:41:07] greg-g, jeremyb_, so the issue is logging into web interfaces, not shell access? [16:41:44] andrewbogott: haven't tried shell, actually [16:42:28] ok. I turn out to be too hungry to function… will y'all still be here in an hour? [16:43:17] i'll be on and off [16:43:23] andrewbogott: i see no sign of greg in admins.pp [16:43:25] bblack: passing back a string turns into a long list of string comparisons though [16:43:38] so therefore shell shouldn't work anyway [16:43:39] in VCL [16:43:41] mark: what for? [16:43:45] ok, back in a bt [16:43:46] bit [16:43:53] andrewbogott: i also should eat. bon appetit :) [16:44:16] I think mobile agreed that they won't do all that logic anymore in VCL but they'll do it on the backend and vary instead [16:44:32] so all that will be gone and we'll just have set req.http.x-cs =, hopefully [16:44:36] mark: the returned string can be "001-01" for an X-CS value. Or if they want two headers we could be ugly but simple and "001-01|WikiTelecom Inc" and give them two values [16:45:24] yeah that's possible [16:45:45] for context, pastebin of two snips in the current play code: the JSON file that associates multiple headers, and the snip of code somewhere in the middle of the VMOD that sets headers directly: http://pastebin.com/VafUR04F [16:45:50] WikiTelecom Inc! [16:46:10] but if they need to vary on something, we need the varied header available in varnish at request time [16:46:37] so if they want to pass extra stuff in the request, that becomes harder if the vmod returns just a single string [16:47:39] What would be really ideal would be the ability to return an array of strings from the VMOD call, but there doesn't seem to be a really clean way to get that via VCL [16:47:50] indeed [16:48:01] but returning a pipe-delimited string isn't bad, it's not hard to split it on the VCL side into a set of values [16:48:29] is there even such a thing as array in vcl? [16:48:58] https://www.varnish-cache.org/docs/trunk/reference/vmod.html#vcl-and-c-data-types [16:49:38] there's STRING_LIST for example, but it's just a macro for a list of char*'s [16:50:01] so it's good as an argument (passing several strings into the VMOD call), but it can't be a retval, and the strings are all const so we can't modify the arguments either :P [16:50:48] the private pointer thing can pass anything back and forth in theory. I'm not sure how ugly it is on the VCL side to extract it though, probably inline C code? [16:50:50] heh [16:50:54] the vmod could offer both of course [16:50:58] but yeah [16:51:02] then it also has all drawbacks [16:51:19] i think a string would work now, we can make it do headers later if really needed [16:57:03] oh so I still have to name this module. vmod_netmapper ? [16:57:20] seems generic enough to not preclude any design changes that might happen :) [16:59:49] mark, have you made up your mind on the mobile resources issue? what do we need to proceed? [17:00:00] MaxSem: what issue specifically? [17:01:01] from where to serve, is anything else needed in addition to the patchsets we committed [17:01:49] I really hope that we will have everything MW-side ready next week [17:02:58] so the question is when we can start experimenting with it (resource variance can be enabled per-wiki, so testwiki sounds like a place to start with) [17:03:34] next week sounds fine to me [17:03:37] I'll take care of the varnish work [17:03:53] greg-g, can you try to log into graphite while I watch the log? [17:03:57] i said on the list that moving device detection, however much I hate it, to bits is probably best [17:04:28] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [17:05:14] andrewbogott: yeah, let me know when you're ready (don't know how busy the log is) [17:05:50] ready [17:06:11] just tried [17:06:21] user gjg [17:06:27] OK, now can you try again but use your full name, 'Greg Grossmeier'? [17:06:37] New review: JanZerebecki; "wikivoyage has two vhosts (the other is named wikivoyage.org) because each of them has a different d..." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [17:06:49] aha! [17:07:00] oh, that worked? [17:07:02] translates to "I'm in" [17:07:08] i don't believe it [17:07:14] * andrewbogott puts on sunglasses [17:07:57] jeremyb_, can you close out that ticket while I eat this pancake? [17:08:05] andrewbogott: hehe [17:08:08] http://www.youtube.com/watch?v=6YMPAH67f4o [17:08:27] > ldaps://virt0.wikimedia.org virt1000.wikimedia.org/ou=people,dc=wikimedia,dc=org?cn [17:09:11] i distinctly remember looking at that and seeing "?cn" and thinking that meant to use the cn value as username and also at some point looking at ldaplist -l passwd gjg [17:09:19] maybe they just weren't close enough in time to eachother [17:09:21] :-( [17:09:56] shell name vs. username is a source of constant confusion [17:10:03] wee [17:10:24] we should change the realm to state that (or whatever the name of the msg that pops up is) [17:10:38] WISHLIST [17:12:47] grrrr [17:12:57] one of the files is names .org and one is .org.erb ?? [17:14:01] paravoid, jeremyb_ what evil plans have you been conceiving in the darkness of ops channel? [17:14:33] yurik: nvm, fixed [17:15:08] hehe, i like those [17:15:49] jeremyb_, not sure what you meant by adding me to wmf... i thought i was :) [17:16:39] yurik: well you weren't an hour ago. maybe someone added you [17:17:02] for that matter bblack is also not wmf (but is ops) [17:17:32] ?? jeremyb_ , i'm a bit confused - what is wmf? i have been employed by foundation since mid march [17:17:46] New patchset: Ori.livneh; "Move gerrit-wm from #mediawiki to #mediawiki-feed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [17:18:22] seems like WMF should be an attribute of some kind rather than a part of the username [17:18:29] on the wiki side I mean [17:19:03] bblack, at first i thought i had to add it, but later i saw a doc that said that if i don't contribute much as myself, i can keep the nick [17:19:33] that's why my accounts are all (wmf) [17:19:34] New review: Ori.livneh; "Should be moved with wikibugs. Change for that still a TODO." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/57752 [17:19:59] yeah I have BBlack (WMF), but I don't plan to use it for any "normal" wp editing or anything [17:20:08] just as a common wikiname for me everywhere for anything official [17:21:17] sorry, i meant thats why all my accounts are not (wmf), but they have a disclaimer at the top [17:21:26] with a nice userbox [17:23:11] so, jeremyb_ is referring to an LDAP group called "wmf" which apparently neither of you are a part of yet. I was added to it last week to gain access to things like graphite [17:23:25] yurik: bblack ^^ [17:23:59] $ for i in yurik bblack; do groups "$i"; done [17:23:59] yurik : svn project-bastion project-editor-engagement project-deployment-prep project-mobile project-mediawiki-api [17:24:02] bblack : wikidev ops project-bastion project-varnish [17:24:06] greg-g++ [17:24:49] greg-g, gotcha, but sttrange, because i thought i looked at graphite a while ago [17:25:13] then something else is giving you the perms, I guess [17:25:18] aanyway, i saw there was an interesting discussion re IP mapping to ID [17:25:20] I don't have as many groups as ya'll [17:25:37] * yurik is the group hog [17:25:43] apparently :) [17:26:02] there's 2 possible explanations [17:26:06] New patchset: Aude; "(bug 45005) Redirect wikidata.org to www.wikidata.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [17:26:13] there always is [17:26:20] 1) you looked at graphite a while ago and not recently and graphite was more open then (it was) [17:26:24] jeremyb_: would you like to review my patch? :) [17:26:32] or 2) you're talking about gdash not graphite [17:26:42] jeremyb_, link? [17:26:52] to which? [17:26:55] https://graphite.wikimedia.org/dashboard/ [17:26:57] gdash.wm.o [17:27:01] graphite.wm.o [17:27:11] easily guessed ;) [17:27:32] or you could say well named :) [17:27:44] New patchset: Ori.livneh; "Move bots from #mediawiki to #mediawiki-feed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [17:27:49] jeremyb_: though, I would love g-.wm.o [17:28:02] jeremyb_, i get a password popup that I have used before (remembered pswrd), but it doesn't take it [17:28:15] so i guess #1 [17:28:25] New review: Aude; "okay, using ServerName www.wikidata.org makes sense here :)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [17:28:40] will need it i strong suspect [17:28:44] strongly [17:28:59] yodish speak i [17:33:06] yurik: well... i did say you're not in wmf :) [17:33:12] so therefore no access [17:33:57] can i have that flag? pretty please? i will behave? [17:34:15] bblack, re X-CS, we will need just one value [17:34:26] have you seen my RFC on that? [17:34:42] http://www.mediawiki.org/wiki/Requests_for_comment/Zero_Architecture ? [17:34:51] greg-g: btw, my names are all the same (except maybe case difference). and apache never used to mind lowercase. so never a problem for me :) [17:35:06] at first i thought i need multipl values, but everyrything other than carrier detection could be done on the php side [17:35:08] ype [17:35:10] yep [17:35:19] ok [17:35:44] yurik: so other bit I need to discuss with you is file format and how we get the data to the vmod [17:36:24] you don't like the one i proposed in the RT ticket? [17:36:27] would it be ok for your tools to generate a JSON of just MCC-MNC => [list of networks], and then do you want your tools to serve that at some canonical URL for the vmod to pick up from or ? [17:37:32] tbh I've only glanced at the RFC, let me read it more for details [17:37:35] bblack, sure, i could whip out an api module to do something like that, so that you can do a call to meta [17:37:56] i thought it would be much easier for C code to juts read a simple text file, one line per IP [17:37:59] New patchset: Jeremyb; "tell users which username to use for LDAP auth" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57754 [17:38:04] greg-g: ^ [17:39:32] bblack, i mostly described the varnish logic in RT ticket, referenced from the bottom of RFC [17:40:35] yurik: wow, you talk about varnish and RFC and i think IETF [17:41:05] aude: idk if i can get to it today :( [17:41:13] yurik: I just like standard formats, I can load a json parser library and be done with it. I guess relative to your RFC what I'm asking about, at the bottom, is instead of having your semi-automated script iterate the ZeroConfig pages and extract to a file that's "uploaded to the varnish servers", have it generate a single JSON file that looks like { '001-01': [ '192.0.2.0/24', '2001:DB8::/32' ], '001-02': …. } and serve it on a canonical UR [17:41:14] with an appropriate Last-Modified header, and the VMOD can pick it up from there. [17:41:53] jeremyb_: ok [17:42:58] bblack, it would be absolutelly not an issue to create, probably even easier than a cron job, but are you sure you want to pull that complexity into varnish??? [17:43:23] varnish should be as simple as possible, with as few libs etc as we can [17:43:26] imho [17:43:27] :) [17:43:48] but totally up to you :) [17:44:13] doing a cron job is much harder for me, that's for sure, but i could generate any format you like [17:44:23] yurik: the HTTP fetch is a little annoying due to libs dependency, yes, but it's just a shared lib. the alternative is we put the logic elsewhere to actually push a file to every varnish server, and that logic then has to know the list of varnish servers and auth to send a file to them, etc... [17:44:56] bblack, shared NFS? [17:45:05] but yeah, HTTP could work too [17:45:25] * bblack hates NFS [17:45:57] i can just see varnish server deciding to get the list, hits cannonical URL, which is also varnish server, which needs to download IP map, and ... etc etc etc [17:46:12] anyways, there's a lot of work I can do before the concern of how the data is sent/fetched is a blocker, we can come back to this later [17:46:30] and yes, I was assuming the canonical URL would not involve recursing back into varnish again :) [17:47:14] but in short, sure, i can very easily generate a full list of stuff in either text-only or JSON format. By a very strange twist of architecture, json is even easier :) [17:47:44] for now I'll just have my test code load JSON from a disk file and see if I can get all the other bits working, then come back to this part [17:47:47] if then there is a cron job that pulls that list to each server, or some other magic [17:48:17] hmmm that's not a bad idea either, just put the HTTP fetch in an external cronjob on the varnish instances [17:48:22] bblack, in reallity you could just have a local cron that saves that file. This way there is always a backup [17:48:27] exactly [17:48:31] and make the canonical URL something internal and private that doesn't go through varnish [17:48:50] sure, but doesn't matter really - unless it gets cached by varnish :) [17:48:57] (don't ask me what that means, I know nothing yet) [17:50:12] btw, bblack , https://office.wikimedia.org/wiki/Contact_list wink wink [17:50:30] it has a nice template ;) [17:51:43] bblack, but in reality, i don't think you even want to do all that -- i think it would make more sense not to load the file into varnish, but reuse the geoIP code [17:52:08] yurik: unfortunately re-using geoip isn't a realistic option, it has limitations [17:52:19] and the cron job would simply pull the data from URI and use debian's db generator [17:52:23] like what? [17:52:31] i thought it was a perfect mapping [17:52:53] robh: can you check to make sure port is enabled and for rdb1 and put in private vlan...i don't think i have access to tampa switches [17:52:57] asw-a4-sdtpa port 21 [17:52:58] i mean yes, it would call carried ID - "country" internally, but that doesn't matter, right ? :) [17:53:12] robh: fyi rdb2 was fine [17:53:14] debian's db generator only does country-format, and the binary country-format has a hard data type limit of 256 distinct "countries", which would be MCC-MNC for us. There's currently 1,6xx or so MCC-MNC on en.wp's list of them... [17:53:39] it would work for now, but someday we'd randomly hit the limit and have to change everything [17:53:48] bleh bleh bleh [17:54:09] is that a limitation of geoip db structure or the generator's shortcoming? [17:54:19] the geoip country data structure [17:54:34] geoip city is more flexible, but debian's existing tools don't know how to generate that [17:54:47] hey robh, is there anything you need for the ssl certificate for metrics.wikimedia.org? (beyond what's in the ticket https://rt.wikimedia.org/Ticket/Display.html?id=4473) [17:54:49] but i don't think the city's structure is that different, is it? [17:54:58] it is [17:55:31] its just that there is a lot of code that would have to be totally reimplemented just for this otherwise, and tested, and performance optimized [17:55:47] but if we can just generate a proper DB, we can reuse all that [17:56:11] yurik: don't worry about that part :) [17:56:32] heh, true that, i'm just not a big fan of "not made here" ;) [17:57:01] you will never see a faster MCC-MNC lookup than what this vmod ends up doing, probably. that's what I do :) it's easier to do it right and fast than it is to go hack up other existing stuff to make them work for this. [17:57:13] :) [17:57:22] make it a very generic looker uper [17:57:35] so that everyone can do IP magic :) [17:57:49] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:57:51] geoip will still our code :) [17:57:54] it's a client-ip-to-string mapper, according to the last discussion. so yes, others can use it as long as that meets their needs [17:58:06] steal* [17:58:19] sounds good, thanks!~~ [17:58:31] i need a new keyboard. or new fingers. [17:59:12] blame the ether-net. the tokens got corrupted when they fell off the ring. [17:59:24] New review: Jeremyb; "Actually I'd prefer to also change it to use uid (shell name) instead. But that maybe requires more ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57754 [17:59:31] the pipes are clogged up [18:00:50] ok, i am off to write the client extension. lots of moving parts there [18:01:39] jeremyb_, so can i look at grahite now? [18:01:49] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [18:02:10] yurik: you're not in the group AFAICT [18:02:24] yurik: you'd normally ask andrewbogott to do that... looks like he's back too [18:02:48] (because he's on RT duty in the /topic. not always him) [18:02:51] jeremyb_, that's just group 'wmf'? [18:02:57] andrewbogott: yah [18:03:06] ok… [18:03:08] drdee: Im waiting on legal now to sign off on the startssl paperwork [18:03:09] sigh, too many security moving parts, something is bound to get through the cracks :( [18:03:12] until they do, i cannot issue a cert [18:03:19] ok, good to know [18:03:19] i have been bugging them every 48 hours. [18:03:21] ty [18:03:22] =[ [18:03:25] andrewbogott: also, see my gerrit change above :) [18:04:02] i mean there is LDAP, and puppet-based cert store, and ... [18:04:33] i don't follow [18:04:44] puppet is not for humans [18:04:51] if you mean ssl [18:04:55] well, i would think it might be better to have one common storage of all security info [18:04:56] ssh keys are in puppet though [18:04:57] yes [18:05:02] exactly [18:05:08] ssh keys are not certs [18:05:12] sorry [18:05:14] that's what i meant [18:05:19] pub keys [18:05:33] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [18:05:59] idk, i think i like it better with all of prod off ldap. and LDAP seems good for labs [18:06:12] most of the time. :) [18:06:48] wouldn't you want one common security store, with an account and all public keys and all permission groups in one place [18:07:32] remember that nice list someone posted when tfinc's laptop got stolen? it had 20 items to check! [18:07:39] no. labs are prod are isolated (mostly) from eachother for a reason [18:07:50] it should have been one "disable account" command :) [18:07:57] labs and prod* [18:09:04] New review: Hashar; "You also need to change the Gerrit hook mapping." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/57752 [18:09:27] just some rumbling on my part really, as that list looked too long to be practical - security should be a simple on/off switch in an ideal world :) [18:10:09] *grumbling [18:10:10] yurik, whaaaaaa? turn security OFF?!! [18:10:19] lol [18:10:33] yep - but not security - one account ;0 [18:10:38] :) [18:11:04] and yes, in the ideal hipi world, we should trust others! [18:11:26] hippie* i think [18:11:35] * jeremyb_ runs off for a while [18:11:42] * yurik throws a snowball at jeremyb_ [18:12:02] * yurik checks outside if there is much snow left [18:12:26] i thinks not [18:12:59] wa! 14C! [18:13:03] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [18:13:14] what am i doing here?! oh yeah, i work here. dope [18:13:31] yurik: 19C on tuesday [18:13:36] my work life balance just crumbled [18:13:50] are we talking about NYC? [18:14:12] oh yeah!!! [18:14:16] but rainy [18:14:18] but still!!! [18:14:20] yurik, try now? [18:14:22] yei summer [18:15:00] andrewbogott, it works! [18:15:02] thx :) [18:15:27] but veeeeeryyyy slow! [18:15:58] i guess i shouldn't have 10 graphs of the entire year ... :) [18:16:30] then again, it should be optimized for that... strange [18:17:00] yurik: https://en.wikipedia.org/wiki/WP:MEET/NYC [18:18:00] New patchset: Pyoungmeister; "some cleanup for multi_instance mysql" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57755 [18:18:41] wow, just got a real shock from greg-g. the youtube didn't load at first and i booted iceweasel and reopened and it started and i didn't know what it was at first [18:18:45] jeremyb_, can't do april, flying away to another conf [18:18:54] yurik: :( [18:19:02] * jeremyb_ really runs away now [18:22:27] !log reedy synchronized php-1.21wmf12 'Killing active users in 1.21wmf12' [18:22:30] Logged the message, Master [18:22:30] jeremyb, are you confident that Authname is only useds as a gui string, not referred to internally anyplace? [18:23:08] uh… jeremyb_ ^^ [18:25:50] Reedy_, sounds bloodthirsty! [18:29:04] jeremyb_: lol [18:29:08] jeremyb_: you're welcome :) [18:31:50] mark: still around? do you have an ETA/expected date for the varnish updates you'll be doing next week? [18:42:17] andrewbogott: lets say i'm 92% certain... [18:42:25] Gerrit unavailable again? [18:42:31] !log restarting gerrit [18:42:38] Logged the message, notpeter [18:42:59] Krinkle: there's a current bug that requires lots of restarting :/ [18:43:13] New patchset: Asher; "s1: moving watchlist to db1049, pulling db1043, adding db1056" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57758 [18:43:34] I was writing some comments. [18:43:55] This has been happening a lot lately (like once a week?) perhaps its time to schedule a window for it instead of whenver operations feels like it [18:43:59] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57755 [18:44:15] Krinkle: ah, gotcha. I'm very sorry [18:44:23] New review: Andrew Bogott; "Looks good to me, but I'd like someone else to confirm that AuthName isn't used for anything other t..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57754 [18:44:27] no worries, just making an observartion [18:44:40] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57758 [18:47:41] error: Your local changes to the following files would be overwritten by merge: [18:47:42] docroot/noc/interwiki/interwiki.cdb [18:47:43] Please, commit your changes or stash them before you can merge. [18:49:38] !log increased jvm heap size on silicon (payments activemq) [18:49:45] Logged the message, Master [18:49:53] ewww, java [18:53:52] New patchset: Pyoungmeister; "including cpufrequtils in the sanitarium/labsdb db class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57759 [18:54:24] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57759 [18:58:26] !log asher synchronized wmf-config/db-eqiad.php 's1: moving watchlist to db1049, pulling db1043, adding db1056' [18:58:33] Logged the message, Master [18:58:39] jeremyb_: :-( [19:00:00] better or worse than erlang? [19:00:34] notpeter: ugh. there is? [19:00:49] ^demon isn't here today [19:00:49] that's not good [19:00:59] the jvm is kinda great [19:01:30] Ryan_Lane: hashar sent out an email about it [19:01:44] binasher: saying that jokingly or seriously? [19:01:58] if he's serious I'm going to rope him in on memory tuning :-P [19:02:53] cmjohnson1: any update on https://rt.wikimedia.org/Ticket/Display.html?id=4734 [19:03:09] and were those going to get some more disks, or something? [19:03:21] just trying to get the full picture of stuffs [19:03:58] notpeter: they're moved and connected to the disk shelves...raid was set up on the shelves raid 10 256k, No Read Ahead [19:04:21] cmjohnson1: awesome! [19:04:25] Ryan_Lane: scala is a great language, lots of things written to run on the jvm are great, and it can perform extremely well [19:04:34] so, I can go ahead and image them? or are they already imaged? [19:05:11] binasher: except during periods of leap seconds. [19:05:33] nopeter: yes you can [19:05:50] they are not imaged [19:05:57] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [19:06:30] notpeter: if only it had chronology protection [19:06:46] cmjohnson1: thanks! want to close the ticket? or i can as well [19:06:54] binasher: someone should get on that [19:07:17] i will close it. [19:08:39] Jeff_Green: errr, erlang is unknown and i've heard a few interesting things. i think. java is known! [19:09:15] jeremyb_: ya, i was just thinking back to our failed attempt with rabbitmq in a previous life [19:09:51] * jeremyb_ wonders if php has a celery equivalent :P [19:10:59] New patchset: Asher; "converting db1050 (s1 snapshot) to mariadb and file-per-table" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57760 [19:11:19] cmjohnson1: woo! thanks! [19:11:41] New patchset: BBlack; "Initial code upload, WIP" [operations/software/varnish/libvmod-netmapper] (master) - https://gerrit.wikimedia.org/r/57761 [19:12:01] so long since i looked into message queus [19:12:01] queues [19:12:09] (not that that really needs review at this point, I just wanted to check that I had gerrit+git setup right for it) [19:12:16] notpeter: yep...fyi..all is done as far as setup...dns changes, network, etc [19:13:01] bblack: why so many .gitignores? [19:13:05] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57760 [19:13:30] cmjohnson1: cool. eeeeexcellent [19:14:37] PROBLEM - DPKG on db1050 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:16:24] New review: JanZerebecki; "How did you test it?" [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/49069 [19:17:37] RECOVERY - DPKG on db1050 is OK: All packages OK [19:21:28] New patchset: Lcarr; "Fix paths to getJobQueueLengths.php from g37441" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57668 [19:21:36] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57668 [19:22:59] New patchset: Ori.livneh; "Move bots from #mediawiki to #mediawiki-feed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [19:25:18] New review: Ori.livneh; "Hashar: thanks for the review! I updated the tests that were checking for mediawiki.log to check for..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [19:27:57] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [19:38:03] New patchset: ArielGlenn; "alt ip for mirror at your.org for dump rsync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57764 [19:40:04] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57764 [19:44:16] I pushed out someone's job queue change script path change as well [19:46:06] apergos: that seems to be LeslieCarr [19:46:21] weird but ok [19:49:58] oh sorry [19:49:59] that was me [19:50:05] for a job queue check [19:50:08] thanks apergos [19:50:22] yw [19:50:33] yeah I looked at it, seemed like a reaosnable change so [19:53:10] !log streaming backup of db1056 to db1050 (converting s1 snapshot slave to file_per_table) [19:53:17] Logged the message, Master [19:56:31] New patchset: Asher; "converting db1021 to mariadb, adding db1058 to s1 for testing only" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57765 [20:00:32] New patchset: Asher; "converting db1021 to mariadb, adding db1058 to s1 for testing only" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57765 [20:01:42] New patchset: Asher; "converting db1021 to mariadb, adding db1058 to s1 for testing only" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57765 [20:01:52] Change merged: BBlack; [operations/software/varnish/libvmod-netmapper] (master) - https://gerrit.wikimedia.org/r/57761 [20:02:00] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57765 [20:03:18] New patchset: Asher; "pulling db1021 for upgrade" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57766 [20:03:49] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57766 [20:05:06] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1021 for upgrade' [20:05:14] Logged the message, Master [20:05:35] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:08:45] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 20:08:35 UTC 2013 [20:09:35] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:09:45] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 20:09:39 UTC 2013 [20:10:25] RECOVERY - Puppet freshness on db1042 is OK: puppet ran at Fri Apr 5 20:10:16 UTC 2013 [20:10:25] RECOVERY - Puppet freshness on db1043 is OK: puppet ran at Fri Apr 5 20:10:17 UTC 2013 [20:10:35] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:10:35] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 20:10:34 UTC 2013 [20:11:35] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:12:15] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 20:12:05 UTC 2013 [20:12:19] Waiting for 10.64.32.26: 25 seconds lagged [20:12:21] on enwiki [20:12:35] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:12:45] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 20:12:43 UTC 2013 [20:13:35] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:13:56] back to normal :) [20:16:28] New review: Hashar; "Wonderful \O/" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/57752 [20:18:24] RobH: what's the status of the hume replacement? [20:18:45] PROBLEM - Host labstore1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:18:45] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:18:46] LeslieCarr: the flagging icinga for xenon notifications look funky. i wonder if that's because the OK-case is satisfied that there is a puppet run currently in progress, whereas the failure check requires a completed run. [20:18:59] oh it's because it's in decom [20:19:17] being re-purposed [20:19:24] pining for the fjords [20:19:24] and naggen runs so that icinga thinks "oh noes, xenon is bad" - then the cleaning out decom script runs so icinga goes "oh it doesn't matter" [20:19:41] LeslieCarr: ahhhhh, makes sense [20:20:47] the old way didn't have this specific problem, but it did basically max out a core on the puppetmaster for 45 minutes, so a few spurious pages is worth the extra performance gain [20:20:56] leslicarr: can we get icinga to stop reporting it though? [20:27:05] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [20:27:12] AaronSchulz: its online and its maint scripts are now setup for both centers [20:27:17] AaronSchulz: afaik its ready to go [20:27:24] i closed off ops part of the ticket [20:27:34] i mean, hum eis still doing the scripts [20:27:49] but terbium should be good to replace whenever we are ready to turn off hume [20:28:05] RobH: can you comment on https://bugzilla.wikimedia.org/show_bug.cgi?id=46428 ? [20:28:06] it should be as simple as swapping the true/false settings in site.pp [20:28:23] I'd like to see that stuff moved over and then ban scripts from running in the wrong place [20:28:28] oh, i have no idea if terbium is aware of mw servers in both sites [20:28:47] i just know the server is online, puppet runs, and its good for the scripts to be worked on via gerrit [20:31:02] New patchset: Ori.livneh; "Disable ClickTracking extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57769 [20:32:12] robh which rt queue should get an SSL certificate purchase request? procurement? [20:32:30] yep, though i warn you they are halted for a week now pending legal review [20:32:57] ie: startssl organization verification needs someone on the org paperwork (like erik) to sign off on some documents [20:33:01] ohrly? what are they reviewing looking for in the review? [20:33:05] that contracts folks are still reviewing [20:33:09] ah ok [20:33:17] New patchset: Ori.livneh; "Disable ClickTracking extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57769 [20:33:25] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Fri Apr 5 20:33:16 UTC 2013 [20:33:31] cuz its some document that verifies that i can, as an agent of wikimedia, purchase certifications with organizational validation set [20:33:35] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [20:33:42] so once its done, i can run off all the certs we need [20:33:43] for no added expense =] [20:34:01] you can even just hand me CSRs in the ticket if thats easier for you [20:34:08] since i imagine its fundraising related and i dont need the private key. [20:34:16] (most folks just have me generate both for them) [20:34:28] just make them minimum 2048bit [20:34:32] that will be fine--this is for the public reporting server [20:34:34] (i figured you would, but just saying) [20:34:40] cool [20:34:41] i'm not really sure why we even care, but apparently we care [20:34:43] yea, procurement [20:34:57] im shocked how much folks actually give a shit about certs [20:34:59] ok. I need to double-check the hostname but I'll put it in soon [20:35:03] yeah [20:35:41] dentist time [20:36:19] LeslieCarr: there was no lunch [20:36:22] lunch was the ballgame [20:36:25] RobH: looks like mwscript has issues with /home there [20:36:27] and i declined that so no vendor lunch ;] [20:36:52] AaronSchulz: Yea, I closed the ticket saying the puppet runs were complete, but the mediawiki related script stuff is still not working [20:37:02] but i assumed the updating of those would fall to dev, no ops [20:37:05] oh, no /usr stuff either, probably missing some conf somewhere [20:37:08] (not trying to pass the buck, just wasnt sure) [20:37:11] =[ [20:37:38] RobH: is it in the dsh group used by scap? [20:38:21] PROBLEM - mysqld processes on db1058 is CRITICAL: NRPE: Command check_mysqld not defined [20:38:21] PROBLEM - MySQL Slave Delay on db1058 is CRITICAL: NRPE: Command check_mysql_slave_delay not defined [20:38:21] PROBLEM - DPKG on db1058 is CRITICAL: NRPE: Command check_dpkg not defined [20:38:31] PROBLEM - Disk space on db1058 is CRITICAL: NRPE: Command check_disk_space not defined [20:38:31] PROBLEM - MySQL Slave Running on db1058 is CRITICAL: NRPE: Command check_mysql_slave_running not defined [20:38:41] PROBLEM - Full LVS Snapshot on db1058 is CRITICAL: NRPE: Command check_lvs not defined [20:38:41] PROBLEM - MySQL disk space on db1058 is CRITICAL: NRPE: Command check_disk_6_3 not defined [20:38:51] PROBLEM - MySQL Idle Transactions on db1058 is CRITICAL: NRPE: Command check_mysql_idle_transactions not defined [20:39:01] PROBLEM - MySQL Recent Restart on db1058 is CRITICAL: NRPE: Command check_mysql_recent_restart not defined [20:39:01] PROBLEM - RAID on db1058 is CRITICAL: NRPE: Command check_raid not defined [20:39:04] well sync-common give perm errors anyway, no parent dirs [20:39:11] PROBLEM - MySQL Replication Heartbeat on db1058 is CRITICAL: NRPE: Command check_mysql_slave_heartbeat not defined [20:39:33] Ryan_Lane: tried building a new instance but its state is error in the instance list.. is that normal during build? [20:39:44] no [20:39:55] let me see why its erroring [20:39:58] which project is this? [20:40:32] ah [20:40:34] testlabs [20:40:51] fault | {u'message': u'RemoteError', u'code': 500, u'created': u'2013-04-05T2 .... [20:40:54] AaronSchulz: so it was in? im not sure, will check [20:41:19] AaronSchulz: its in none of them =p [20:41:22] so will have to add today [20:41:27] InstanceNotFound: Instance i-0000066b could not be found. [20:41:28] wtf [20:41:59] RemoteError: Remote error: QuotaError Quota exceeded: code=FixedAddressLimitExceeded [20:42:02] double wtf [20:43:39] where in the hell would that be set? [20:49:25] binasher: https://bugs.launchpad.net/horizon/+bug/1157950 [20:50:32] wtf [20:50:41] PROBLEM - Disk space on cp1041 is CRITICAL: Timeout while attempting connection [20:51:20] notpeter: how does /usr/local/apache/common-local get initially created on new servers? [20:51:37] binasher: ok. it'll work now [20:51:47] binasher: delete/recreate [20:52:07] that bug doesn't even fucking affect us, but now I have to deal with the change [20:52:20] I'm just going to up that default to something really high [20:52:23] haha [20:52:31] RECOVERY - Disk space on cp1041 is OK: DISK OK [20:52:35] we limit that by instances anyway [20:53:43] Ryan_Lane: ok, recreated instance is in building state. thanks! [20:53:47] yw [20:54:05] I didn't see it in the list of quotas because the controller hadn't applied the patch [20:54:51] New patchset: MaxSem; "Rebuild LocalisationCache to tmpfs, appears helluva faster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57774 [20:55:05] New patchset: Ryan Lane; "Set the fixed_ips quota to something high" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57775 [20:55:38] what I really love is that this errors out *after* the instance builds [20:55:44] * Ryan_Lane grumbles [20:55:54] well, after the api says "yes, I'll build this for you" [20:56:05] New patchset: Ryan Lane; "Set the fixed_ips quota to something high" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57775 [20:56:26] heh. puppet repo being ff-only is a pain in the ass [21:06:35] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [21:08:44] AaronSchulz: you there? check out #wikimedia-tech [21:08:54] flaggedrevs are utterly broken on plwikisource [21:10:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57775 [21:11:50] New patchset: Asher; "Revert "pulling db1021 for upgrade"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57776 [21:12:04] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57776 [21:13:15] !log asher synchronized wmf-config/db-eqiad.php 'returning db1021 at a low weight' [21:13:23] Logged the message, Master [21:13:44] AaronSchulz: so yea, it should make those, im tossing in node lists and attempting manual script runs [21:14:00] sweet [21:14:25] death to tampa servers \o/ [21:14:27] New patchset: Andrew Bogott; "Rework the RT manifests so it can be installed in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026 [21:18:25] New patchset: Asher; "pulling db1042" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57777 [21:19:32] RobH: I see files :) [21:21:03] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57777 [21:22:25] PROBLEM - Host db1058 is DOWN: PING CRITICAL - Packet loss = 100% [21:26:21] !log asher synchronized wmf-config/db-eqiad.php 'returning db1021 to full weight, pulling db1023' [21:26:58] Logged the message, Master [21:27:04] AaronSchulz: still running, very slow [21:27:07] =P [21:27:19] this is why i manually run it to see verbose output, damn you scripts! [21:27:45] RECOVERY - Host db1058 is UP: PING WARNING - Packet loss = 64%, RTA = 0.43 ms [21:30:05] PROBLEM - SSH on db1058 is CRITICAL: Connection refused [21:33:44] AaronSchulz: do you still need an answer? [21:33:59] it might be nice, though probably not [21:34:25] there's an exec that runs and does sync common whenever puppet runs and apache isn't running [21:34:44] this was to get around nodes getting turned back on and running horribly out-dated versions of mediawiki [21:34:51] but it also covers the case of "just installed" [21:37:11] RECOVERY - SSH on db1058 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:38:39] New patchset: Asher; "removing replaced hosts from s1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57778 [21:39:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57778 [21:41:41] PROBLEM - NTP on db1058 is CRITICAL: NTP CRITICAL: Offset unknown [21:43:41] RECOVERY - NTP on db1058 is OK: NTP OK: Offset -0.07375764847 secs [21:48:04] uhh [21:48:09] who has edited pdns [21:48:13] and added a bunch of srv entries? [21:48:15] (wtf?!?! [21:48:16] ) [21:48:26] i should have svn diff before i touched it [21:49:08] cmjohnson1: was this you? [21:49:12] (i doubt it, but dunno) [21:49:46] not me [21:49:56] now its all fubar, i have no idea who did this [21:50:04] oh sorry robh [21:50:04] it was me [21:50:05] ...why? [21:50:05] i forgot to check in [21:50:11] we dont use srv. [21:51:57] New patchset: Pyoungmeister; "updating labsdb role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57780 [21:54:26] New patchset: Pyoungmeister; "updating labsdb role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57780 [21:55:03] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57780 [21:55:24] the new dell srv's [21:59:36] New patchset: MaxSem; "Rebuild LocalisationCache to tmpfs, appears helluva faster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57774 [22:06:36] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [22:35:14] New review: MZMcBride; "In terms of functionality, this seems fine." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [22:47:08] New review: MZMcBride; "I'm trying to assume good faith, but it's difficult to see how you could link three bugs in the comm..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [23:04:51] New review: Ori.livneh; "For the record:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [23:05:27] New review: Ori.livneh; "God damn line breaks argh blah" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57752 [23:05:59] New review: Matmarex; "Personally I believe that this is the right path forward, and consensus be damned (even though it se..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57752 [23:06:12] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [23:10:44] !log running iwlinks index migration on s4 (commons) [23:10:51] Logged the message, Master [23:42:19] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:46:19] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:46:19] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [23:46:19] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [23:47:15] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57310