[00:00:06] adding the cname now [00:00:25] then need to check out the rsyslogd setup in puppet [00:03:04] New patchset: Reedy; "Move wgDebugLogGroups from CommonSettings.php to InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46651 [00:03:06] AaronSchulz: ^ You're welcome ;) [00:03:51] you remembered ;) [00:04:21] New patchset: Reedy; "Move wgDebugLogGroups from CommonSettings.php to InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46651 [00:05:20] RoanKattouw_away: i changed my user agent to fake Internet Explorer 8 and the VisualEditor appeared [00:05:24] Reedy: its added and but will take some time to propagate (syslog.eqiad.wmnet is an alias for fenari.wikimedia.org.) [00:06:01] Great, thanks. It shouldn't be a major rush now [00:06:02] mutante: Well there you go, then it's a bug with us not recognizing Iceweasel [00:06:27] want Bugzilla? [00:06:32] ^demon: Im trying to get the strace stuff pushed for you [00:06:37] New patchset: RobH; "sudo for strace for demon in role appservers (RT-4066)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42791 [00:06:39] who knows if folks will approve [00:09:32] notpeter: wanna take a gander at the above patchset and see if its sane? [00:09:49] daniel listed you as a reviewer on it (dunno why) [00:09:50] New patchset: Reedy; "Move wgDebugLogGroups from CommonSettings.php to InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46651 [00:09:59] i asume cuz you deal with apaches a lot recently. [00:10:36] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46651 [00:10:37] RobH: sounds legit [00:11:34] hrmm, i wonder if i should push it or just poke mark about it tomorrow am [00:11:48] notpeter: it seems legit enough that i can just merge it dontcha think [00:11:49] ? [00:11:55] * RobH wants other folks on the firing line with him [00:12:16] the linked patchset looks like it will accomplish the task stated in that rt ticket [00:13:03] https://gerrit.wikimedia.org/r/#/c/42791/3 [00:13:28] rfaulkner: https://gerrit.wikimedia.org/r/#/q/status:open+project:sartoris,n,z [00:13:55] !log reedy synchronized wmf-config/ [00:13:56] Logged the message, Master [00:15:28] New patchset: RobH; "sudo for strace for demon in role appservers (RT-4066)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42791 [00:17:56] wtf: host syslog.eqiad.wmnet recursor0 => "not found: 3(NXDOMAIN)" vs. host syslog.eqiad.wmnet ns0 => "syslog.eqiad.wmnet is an alias for fenari.wikimedia.org" [00:18:04] recursor0 and ns0 both == dobson [00:19:13] !log restarted pdns-recursor on dobson [00:19:14] Logged the message, Master [00:20:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:57] New review: RobH; "I made the changes requested by Faidon, and this now looks legit to me. However, as it touches ever..." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/42791 [00:23:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.141 second response time [00:23:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.458 seconds [00:29:21] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [00:30:00] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [00:34:47] AaronSchulz: I have seen that, yes [00:36:03] New patchset: Dzahn; "remove setting $wgBlockDisablesLogin to true for foundationwiki (RT-690) (bug 44473)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46653 [00:42:58] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46653 [00:43:56] !log sync InitialiseSettings.php for bug 44473 [00:43:57] Logged the message, Master [00:44:05] !log dzahn synchronized ./wmf-config/InitialiseSettings.php [00:44:06] Logged the message, Master [00:44:15] ty :) [00:44:23] New patchset: Pyoungmeister; "debug: adding some notifies for troubleshooting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46654 [00:45:12] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46654 [00:48:30] New patchset: Pyoungmeister; "Revert "debug: adding some notifies for troubleshooting"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46655 [00:49:09] rfaulkner: ping [00:49:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46655 [00:56:34] Ryan_Lane: I need your help on salt when your crisis is over [00:56:45] paravoid: what do you need help with? [00:56:49] [WARNING ] Setting up the Salt Minion "ms-be1003.eqiad.wmnet" [00:56:49] [CRITICAL] The Salt Master has rejected this minion's public key! [00:56:53] what I'm doing is a boring long shitty manual provess [00:56:55] the box was reprovisioned [00:56:57] ah [00:57:03] so I'm guess we don't revoke salt keys or something [00:57:10] salt-key -d ms-be1003.eqiad.wmnet [00:57:16] on sockpuppet [00:57:16] can we automate this? [00:57:21] yes [00:57:28] and sync states between puppet/salt? [00:57:38] I'd like to change our bootstrapping to install salt first [00:57:48] okay, if I do salt-key -d will puppet recreate the key then? [00:57:49] then to have salt run puppet on the server [00:57:52] yep [00:58:12] we can use salt reactors to automatically sign puppet certs based on salt certs [00:58:21] same with deleting them when salt keys are deleted [00:59:09] I started working on that last night for labs. it should be easy to rework it for production too [00:59:53] New patchset: Pyoungmeister; "some var cleanup per faidon's suggestion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46658 [00:59:57] can't we just make puppet sync keys no matter what? [01:00:19] what do you mean? [01:00:20] for salt? [01:00:25] yeah [01:00:41] salt isn't x509 [01:01:01] so the pub/private keys can't be the same [01:02:04] with proper bootstraping we can actually avoid needing the new_install ssh key [01:03:38] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46658 [01:05:47] we could do exported resources for the pub key, so that the master knows the salt key is valid [01:05:47] that'll be slower and slightly less useful, though [01:27:06] New patchset: Pyoungmeister; "debug: more notifies for debugging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46663 [01:28:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46663 [01:28:45] stupid SVN..arggggg [01:29:13] "unexpectedly changed special status" thing ...annoy [01:48:07] New patchset: Pyoungmeister; "using primary_site = "none" instead of primary_site = false for replication topology" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46667 [01:56:17] Change abandoned: Asher; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46667 [01:57:58] New patchset: Asher; "trying to fix slave monitoring logic for read-only masterless instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46670 [01:58:28] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46670 [02:02:10] oh hey [02:02:33] binasher: I was chatting with peter about that, did he leave for the day? [02:02:46] yeah, it was a simple thing to fix.. [02:03:02] yeah didn't initially see it either though [02:03:13] string interpolation into boolean [02:03:56] nah, was trying to treat something undef as a bool false [02:05:02] New patchset: Asher; "cleaning up debug notifies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46671 [02:05:02] err [02:05:08] I don't see it [02:05:12] thought I did [02:05:20] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46671 [02:05:52] $dict { 'key1' => false, 'key2' => string } [02:06:38] $dict['key1'] -> doesn't actually get defined as a bool, it doesn't exist [02:06:59] puppet. [02:08:04] wtf [02:08:06] so you can't test if $dict['key1'] == false in the above case [02:08:07] didn't know that [02:08:07] http://projects.puppetlabs.com/issues/18234 [02:08:16] fixed in 3.0.2 it seems [02:08:54] hah, awesome. fixing that of course breaks everything based around years of puppets flawed logic [02:09:30] seems like a thing to change in a major version release only [02:09:48] oh wait, not merged yet [02:09:55] In Topic Branch Pending Review [02:10:23] i'd prefer it wait til 3.1.. but not enough to post [02:11:02] anyway [02:11:05] thanks for that [02:11:24] no prob [02:11:32] I didn't know that, although it does seem something that would have bitten me before in the past [02:14:59] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [02:16:01] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [02:16:15] that would be me, ignore [02:17:21] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [02:17:32] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [02:19:01] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [02:21:19] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [02:21:22] PROBLEM - SSH on labstore4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:12] RECOVERY - SSH on labstore4 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [02:28:23] !log LocalisationUpdate completed (1.21wmf8) at Wed Jan 30 02:28:22 UTC 2013 [02:28:25] Logged the message, Master [02:38:43] dzahn is doing a graceful restart of all apaches [02:39:24] !log dzahn gracefulled all apaches [02:39:25] Logged the message, Master [02:39:26] dzahn: https://bugzilla.wikimedia.org/show_bug.cgi?id=44395 too [02:52:03] !log LocalisationUpdate completed (1.21wmf7) at Wed Jan 30 02:52:03 UTC 2013 [02:52:04] Logged the message, Master [02:54:52] catrope is doing a graceful restart of all apaches [02:55:05] !log catrope gracefulled all apaches [02:55:06] Logged the message, Master [02:56:34] catrope is doing a graceful restart of all apaches [03:06:49] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46011 [03:07:25] !log reloading Apaches in eqiad using manual dsh (because apache-graceful-all sanity check has a bug) [03:07:26] Logged the message, Master [03:09:13] that's lots of gracefulling [03:11:16] !log wikimania.asia redirect now works, it did not before because of a bug in apache-graceful-all not actually restarting Apaches in eqiad [03:11:17] Logged the message, Master [03:11:41] jeremyb: yea, manual gracefulling bypassing the sanity check, heh [03:11:56] because that check had a bug, thanks to Roan for finding it [03:12:58] mutante: oh, he wasn't trying to graceful for his own thing he was just debugging? [03:13:08] One of those was me typing less `apache-graceful-all` instead of less which `apache-graceful-all` [03:13:21] yea [03:13:21] Before that, I ran it once for real to see the output [03:13:25] neither of those [03:13:33] less `which apache-graceful-all` [03:13:50] Sorry yes, the latter [03:13:58] :) [03:14:26] * jeremyb wonders what wikimania.asia even means [03:15:01] wikimania2013 [03:15:37] jeremyb: try it in your browser [03:16:32] Works for me. That's neat. [03:16:49] ohhhhhhhhhhh [03:16:52] asia is a TLD [03:17:02] it is, just like .museum and .xxx :) [03:17:44] (didn't try it yet. you said try it so i just went straight to wikimania2013.wikimedia.org) [03:17:44] yea, this was about redirecting one to the other [03:17:44] now, tried and it does seem to work [03:17:46] right [03:17:53] i just thought it was a fake internal name [03:17:56] like smnet [03:17:58] wmnet* [03:18:23] Nope [03:18:27] heh.. i see.yeah. i expect we will be asked to redirect this one in every year Wikimania is actually in Asia:) [03:18:39] We just had a lot of fun because http://wikimania.asia is the first new redirect we set up after the eqiad migration [03:18:54] we should do Istanbul just to discuss whether it's Europe or Asia..haha [03:19:27] http://en.wikipedia.org/wiki/.asia [03:19:28] we could just redirect both europe and asia to istanbul [03:19:49] mutante: we already did a controversy: gdanzing [03:19:54] gdanzig* [03:20:09] hmm.. there is .eu but not .europe [03:20:40] oh, i can see that discussion yeah.. Gdansk [03:20:57] i think maybe there's an eu.org or something too [03:21:08] hey, another thing.. we have those US based state or city chapters, right [03:21:21] like Wikimedia NYC, and California etc [03:21:27] i think there's no city chapters [03:21:46] there is a plan to make a .nyc tld [03:21:46] there are [03:21:48] Wikimedia Boston [03:21:51] no [03:22:00] wikimedia new england not boston [03:22:06] and NYC is really not NYC [03:22:26] DC is maybe the closest to being just a city but it's also multiple states really [03:22:32] http://meta.wikimedia.org/wiki/Wikimedia_chapters [03:22:46] that table still lists it as Boston [03:22:54] but i see it's redirected..ok [03:23:35] well, state or city is secondary.. here's the suggestion though [03:23:49] that page needs updates! [03:23:58] it says 2012 in future tense [03:24:00] we have "pa.us.wikimedia" and stuff for those [03:24:08] pa is defunct [03:24:11] and the problem is that they can't support HTTPS [03:24:25] because we cant get a *.*.wikimedia.org certificate [03:24:31] i know [03:24:33] mobile [03:24:35] and at the same time we have wikimedia.us the domain [03:24:39] and it doesnt redirect anywhere [03:25:15] so my suggestion is to solve those both at once [03:25:15] by declaring wikimedia.us the domain for those US chapters [03:25:15] and use pa.wikimedia.us etc [03:25:19] yeah, except there's no PA activity [03:25:20] have nicer URLs, dont have a certificate problem AND have a use for that domain .. and it actually matches [03:25:32] pa is within the jurisdiction of NYC [03:25:33] it is wikimedia ...in the us [03:26:02] anyway, i can suggest it. i don't have a strong opinion [03:26:04] yea, it is not just that one.. it is about all of them in *.us.wikimedia [03:26:17] what are there? [03:26:22] nyc is nyc.wikimedia.org [03:26:27] not .us.wikimedia.org [03:26:43] pa is defunct [03:27:29] hmmm.. interesting ..maybe you are right and we can just close that ticket ..also nice:) [03:27:38] looking at DNS zone for all sub.sub domains [03:28:03] de.labs, flaggedrevs.labs, www.commons, noboard.chapters ... [03:28:19] but indeed no other US chapters as i expected [03:28:49] http://noboard.chapters.wikimedia.org/ ??? [03:28:54] what is that:) [03:29:33] mutante: norway? [03:29:37] rings a bell [03:31:17] yeah, Norwegian. nice skin though,hah [03:32:28] mutante: Looks like the "standard" private wiki skin from a while back. [03:32:40] compare with otrs-wiki [03:32:46] Yup. [03:33:24] ok, it's the only one that has *.chapters.wikimedia.org [03:33:29] let's clean that up some time [03:33:54] mutante: Ideally none of the wikis for chapters should be hosted by WMF, of course. [03:33:54] i hope they are fine moving to something else.. if its still used [03:34:16] James_F: that is different with every single domain name :p [03:34:22] mutante: I know. [03:34:33] mutante: But it's actively-problematic legally, for instance. [03:34:38] * James_F sighs. [03:34:45] all domain transfers go through legal ..shrug [03:35:17] Yeah. [03:35:27] sometimes the chapters own domains and have their nameservers [03:35:31] 'Cos they have to worry about trademarks, and have to worry about costs. [03:35:35] sometimes they own the domain but point to our NSes [03:35:42] sometimes we own it but redirect to them [03:35:47] Yeah. [03:35:48] and so on..you will find any combination [03:36:09] thats why we have a whole queue just for domains..sugh [03:36:10] I was the original owner of wiki{p,m}edia.{co,org}.uk. [03:36:29] That's interesting syntax. [03:36:35] sometimes it's owned by a complete stranger but redirects to a sensible place so no one has managed to get it transferred (not quite sure the entire story. thinking of wikimania.org) [03:36:45] I was going to transfer them to WMF, but WMF wasn't keen at the time, and WMUK (which I founded) then wanted them, so in the end WMUK got them. [03:37:00] Susan: try echo wiki{p,m}edia.{co,org}.uk. in bash [03:37:24] Susan: RISC OS PRM house style. Sticks so much, 20 years later. [03:37:56] Interesting. [03:38:05] I thought it was just mangled regex, heh. [03:38:08] no [03:38:18] Susan: echo {0..5} [03:38:28] I hate bash. [03:38:41] Susan: Bash is lovely, you. [03:41:07] James_F: well, if you wanna try again.. mpaulson :) [03:41:24] mutante: I no longer have the domains. WMUK does. :-) [03:41:32] mutante: As of 2005. [03:41:50] mutante: Sorry, should have been more clear. [03:44:26] alright, as long the chapter and legal is happy:) ehmm.. but wikipedia.co.uk is broken .. [03:44:50] * James_F sighs. [03:44:50] while i see remnants of it in our DNS .. why is this not surprising anymore..hrmm [03:44:56] Will ping them an e-mail. [03:45:41] Oh, wait. [03:45:47] wikipedia.co.uk and .org.uk went to WMF. [03:45:51] Or, rather, back then, to Bomis. [03:45:54] Who still have it. [03:46:02] Which is a bit of a problem. :-) [03:46:21] * James_F sighs. [03:46:24] Someone else can deal. [03:46:33] this is fun! [03:46:43] legal :) [03:48:13] oh, and it still says aude is president on the page mutante linked! [03:48:16] way out of date [03:48:21] (as of marchish last year) [03:49:14] NYC is also wrong [03:51:14] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46558 [03:51:44] * jeremyb runs away [03:51:48] Faster. [03:52:16] yep, me too. i'll be back from European timezone... cu ..out [03:59:21] New patchset: Spage; "Add a logbot class for #wikimedia-e3 channel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46672 [04:34:14] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:35:14] PROBLEM - HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:00] PROBLEM - HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:09] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:47:24] PROBLEM - NTP on kaulen is CRITICAL: NTP CRITICAL: No response from NTP server [04:58:00] PROBLEM - NTP on kaulen is CRITICAL: NTP CRITICAL: No response from NTP server [04:59:19] Bugzilla is down. [05:01:24] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [05:34:27] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [05:34:38] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 184 seconds [05:34:39] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 184 seconds [05:35:30] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 203 seconds [05:36:15] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [05:36:39] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [05:36:39] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [05:37:18] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [05:39:06] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [05:39:06] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [05:39:06] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:39:06] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [05:39:07] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [05:39:07] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:41:03] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [05:43:49] PROBLEM - Host kaulen is DOWN: PING CRITICAL - Packet loss = 100% [05:47:57] PROBLEM - Host kaulen is DOWN: CRITICAL - Host Unreachable (208.80.152.149) [05:52:51] I'm looking at kaulen [05:53:48] So am I, but you're a lot better at looking at things [05:53:56] I'm mostly just going "oh no". [05:57:18] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [05:57:28] RECOVERY - Host kaulen is UP: PING OK - Packet loss = 0%, RTA = 26.93 ms [05:57:39] RECOVERY - HTTP on kaulen is OK: HTTP OK: HTTP/1.1 302 Found - 489 bytes in 0.055 second response time [05:57:40] you don't need to be smart to press the reset button [05:57:49] you do need the password though, which I assume you don't have ;) [05:58:03] RECOVERY - HTTP on kaulen is OK: HTTP OK - HTTP/1.1 302 Found - 0.005 second response time [05:58:03] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [05:58:12] RECOVERY - Host kaulen is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [05:58:28] yup. [06:05:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:09:25] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [06:25:24] PROBLEM - Host mw1041 is DOWN: PING CRITICAL - Packet loss = 100% [06:25:55] PROBLEM - Host mw1041 is DOWN: PING CRITICAL - Packet loss = 100% [06:28:24] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 06:28:22 UTC 2013 [06:28:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:28:44] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 06:28:36 UTC 2013 [06:29:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:33:54] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 06:33:44 UTC 2013 [06:34:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [07:36:46] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [07:43:05] PROBLEM - Puppet freshness on labstore3 is CRITICAL: Puppet has not run in the last 10 hours [07:45:20] New patchset: Ryan Lane; "Changes for newer glusterfs packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46680 [07:46:03] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46680 [07:51:02] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Puppet has not run in the last 10 hours [07:52:59] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [08:07:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 08:07:42 UTC 2013 [08:08:31] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:09:49] New patchset: Ryan Lane; "Split gluster into gluster-client/gluster-server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46681 [08:10:28] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46681 [08:11:44] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Jan 30 08:11:13 UTC 2013 [08:12:38] RECOVERY - Puppet freshness on labstore3 is OK: puppet ran at Wed Jan 30 08:12:22 UTC 2013 [08:19:14] RECOVERY - Puppet freshness on labstore1 is OK: puppet ran at Wed Jan 30 08:18:51 UTC 2013 [08:25:19] New patchset: Ryan Lane; "Pin labsconsole memcache to ubuntu version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46684 [08:29:08] RECOVERY - Puppet freshness on labstore2 is OK: puppet ran at Wed Jan 30 08:28:51 UTC 2013 [09:05:16] New patchset: Silke Meyer; "Add a variable to enable/disable experimental Wikidata features in labsconsole" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46536 [09:16:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [09:24:45] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [09:26:17] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [09:26:39] mutante, paravoid, Ryan_Lane can you check https://bugzilla.wikimedia.org/show_bug.cgi?id=44499 [09:26:49] labsconsole is down [09:30:45] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.027 second response time on port 11000 [09:31:32] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.032 second response time on port 11000 [10:09:23] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [10:38:30] paravoid or apergos could you review https://gerrit.wikimedia.org/r/#/c/46548/ please?:) [10:42:32] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46548 [10:47:41] done, puppet ran ok [10:47:47] back in a while (lunch) [10:54:11] thanks! [11:01:04] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [11:19:29] New patchset: Hashar; "(bug 44424) wikiversions.cdb for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46726 [11:19:46] New review: Hashar; "Added back with https://gerrit.wikimedia.org/r/#/c/46726/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46240 [12:07:38] RECOVERY - Puppet freshness on labstore2 is OK: puppet ran at Wed Jan 30 12:07:36 UTC 2013 [12:09:13] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:14:58] New patchset: Mark Bergsma; "Remove double slashes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44076 [12:14:58] New patchset: Mark Bergsma; "Set CORS header in vcl_deliver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44078 [12:15:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44076 [12:19:03] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [12:22:19] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [12:22:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44078 [12:23:30] ololo, someone's attempting an SQL injection attack on us:) [12:23:42] Exception from line 549 of /usr/local/apache/common-local/php-1.21wmf8/includes/content/ContentHandler.php: Format text/x-wiki%' aNd 4795=4796-1 aNd 'VNwC'!=' is not supported for content model wikitext [12:24:06] lots of similar crap in exception logs [12:24:17] shall we ban this fucker? [12:25:24] will it stop them? ;) [12:25:58] if we ban on a Squid level, it will. for a time, if they're persuistent:) [12:26:28] Format text/x-wiki'/**//**/> is not supported [12:26:42] well they can just change to a different ip [12:46:02] New patchset: Mark Bergsma; "Use $::mw_primary for bits app servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46738 [12:47:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46738 [12:50:53] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [12:51:54] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [12:52:48] !log maxsem synchronized php-1.21wmf8/extensions/GeoData [12:52:49] Logged the message, Master [12:54:56] !log maxsem synchronized php-1.21wmf7/extensions/GeoData [12:54:56] Logged the message, Master [13:11:14] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [13:13:14] RECOVERY - Lucene on search1015 is OK: TCP OK - 3.003 second response time on port 8123 [13:57:25] New patchset: Mark Bergsma; "Add backends hash to role::cache::configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46745 [13:57:25] New patchset: Mark Bergsma; "Replace some backend definitions by $::role::cache::configuration hash counterparts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46746 [13:59:27] New patchset: Hashar; "(bug 44506) raise throttle for an Israel editthon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46747 [13:59:50] New patchset: Mark Bergsma; "Replace some backend definitions by $::role::cache::configuration hash counterparts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46746 [13:59:50] New patchset: Mark Bergsma; "Add backends hash to role::cache::configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46745 [14:00:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46745 [14:00:55] !! [14:00:56] ! [14:01:26] there's much more to come [14:01:38] that is the poor man Hiera [14:01:47] that is going to make stuff a bit cleaner indeed :-] [14:02:00] i have to do this slowly, step by step [14:03:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46746 [14:06:04] New patchset: Mark Bergsma; "include role::cache::configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46749 [14:06:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46749 [14:07:13] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:08:12] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.004 second response time on port 8123 [14:10:43] New patchset: Mark Bergsma; "Define test backends again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46750 [14:11:12] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:11:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46750 [14:12:03] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [14:18:08] New patchset: Mark Bergsma; "Remove unnecessary $multiple_backends intermediary hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46751 [14:20:37] New patchset: Mark Bergsma; "Fix test_wikipedia in beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46752 [14:28:40] New patchset: Mark Bergsma; "Remove unnecessary $multiple_backends intermediary hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46751 [14:28:40] New patchset: Mark Bergsma; "Fix test_wikipedia in beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46752 [14:31:32] New patchset: Mark Bergsma; "Remove unnecessary $multiple_backends intermediary hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46751 [14:32:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46751 [14:33:25] New patchset: Mark Bergsma; "Fix test_wikipedia in beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46752 [14:34:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46752 [14:36:32] New patchset: Mark Bergsma; "Remove unused backend "test"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46755 [14:37:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46755 [14:40:58] so hashar [14:41:10] we now have the new variable $::mw_primary which is set to the active mediawiki cluster [14:41:31] what is its content ? [14:41:33] for labs, which doesn't even have a second site yet, shall we just always set that to equal $::site ? [14:41:36] now it's 'eqiad' [14:41:44] but it can flip back to pmtpa when we switch back [14:41:57] I have no idea where the labs machine are [14:41:59] I guess in eqiad [14:42:04] no they're in pmtpa [14:42:06] oh [14:42:07] ;-D [14:42:42] so yeah just always set it to $::site [14:42:44] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 181 seconds [14:42:49] since beta does not have a cluster in eqiad [14:43:01] and it is unlikely we will have set it up one day with multiple datacenters [14:43:14] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 182 seconds [14:44:08] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 201 seconds [14:44:09] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 201 seconds [14:44:59] when we do, it's easy to change [14:45:09] and it allows me to simplify stuff now [14:45:55] New patchset: Mark Bergsma; "Simplify $test_hostname definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46756 [14:45:56] New patchset: Mark Bergsma; "Set $::mw_primary to 'eqiad' in realm production, to $::site elsewhere" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46757 [14:47:30] i keep forgetting selectors don't support multiple values [14:48:55] New patchset: Mark Bergsma; "Simplify $test_hostname definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46756 [14:48:56] New patchset: Mark Bergsma; "Set $::mw_primary to 'eqiad' in realm production, to $::site elsewhere" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46757 [14:49:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46756 [14:50:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46757 [14:59:18] New patchset: Mark Bergsma; "Change to case statements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46758 [15:00:43] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [15:01:13] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [15:01:41] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [15:01:50] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [15:02:03] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [15:04:27] New patchset: Mark Bergsma; "Change to case statements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46758 [15:06:06] New patchset: Mark Bergsma; "Change to case statements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46758 [15:08:29] New patchset: Mark Bergsma; "Change to case statements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46758 [15:09:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46758 [15:15:45] New patchset: Mark Bergsma; "$::mw_primary can now be used in realm labs as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46761 [15:15:46] New patchset: Mark Bergsma; "Remove backends/director backends duplication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46762 [15:17:46] New patchset: Mark Bergsma; "Remove backends/director backends duplication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46762 [15:19:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46762 [15:19:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46761 [15:22:34] ah oh [15:23:00] my bits labs ends up with a duplicate definition of the ipv4_10_4_0_166 backend [15:23:01] :D [15:23:54] New patchset: Mark Bergsma; "Cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46763 [15:23:54] New patchset: Mark Bergsma; "Move the logging into its own sub class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46764 [15:24:36] right [15:24:46] because the test server is the same as a normal bits app server [15:25:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46763 [15:25:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46764 [15:31:26] New patchset: Mark Bergsma; "Use stdlib values() to calculate varnish backends from directors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46765 [15:32:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46765 [15:34:22] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46747 [15:34:41] New patchset: Mark Bergsma; "Remove duplicate backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46766 [15:35:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46766 [15:37:30] !log hashar synchronized wmf-config/throttle.php '(bug 44506) raise throttle for an Israel editthon {{gerrit|46747}}' [15:37:31] Logged the message, Master [15:37:32] New review: Hashar; "I have deployed the change." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46011 [15:38:24] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 44395) Allow bureaucrats to remove the translateadmin group on wikidatawiki {{gerrit|46011}}' [15:38:25] Logged the message, Master [15:38:45] New review: Hashar; "deployed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46747 [15:38:49] New review: Alex Monk; "recheck" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/46547 [15:40:13] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [15:40:13] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [15:40:13] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [15:40:13] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:40:13] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [15:40:14] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:40:32] New patchset: Mark Bergsma; "Remove duplicate cluster_options definitions using stdlib merge()" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46767 [15:41:21] yeah that is starting up again :-) [15:41:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46767 [15:42:01] role::cache::bits looks a lot better now :) [15:42:08] there's still room for improvement, but i'm gonna look at the other clusters first [15:42:19] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [15:43:05] I will have my patch rebased later tonight [15:43:23] that's mobile, haven't touched that yet [15:43:24] but [15:43:34] $cluster_options['test_server'], is that used anywhere? [15:43:37] it seems not [15:48:13] New patchset: Mark Bergsma; "Remove $cluster_options[test_server], always set [test_hostname] as it shouldn't hurt" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46768 [15:49:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46768 [15:56:49] mark: I am leaving now. If you ever want to test out on labs, the bits instance is deployment-cache-bits03.pmtpa.wmflabs [15:58:10] ok [15:58:12] *wave* [15:58:23] bye! [16:03:57] New patchset: Mark Bergsma; "Calculate the backends parameter from values($directors) if not explicitly given" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46770 [16:04:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46770 [16:04:51] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:07:42] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.003 second response time on port 8123 [16:08:01] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 16:07:55 UTC 2013 [16:08:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:08:52] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 16:08:42 UTC 2013 [16:09:22] New patchset: Mark Bergsma; "Remove superfluous backends parameters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46771 [16:09:41] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:09:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 16:09:49 UTC 2013 [16:10:11] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [16:10:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46771 [16:10:41] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:10:42] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 16:10:36 UTC 2013 [16:11:41] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:13:08] New patchset: Mark Bergsma; "Define the API backend for mobile the proper way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46773 [16:13:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46773 [16:21:02] New patchset: Mark Bergsma; "Apparently mobile frontend contacts test directly, redefine" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46774 [16:21:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46774 [16:27:24] New patchset: Mark Bergsma; "Simplify mobile cache backend configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46775 [16:28:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46775 [16:30:05] New patchset: Mark Bergsma; "$::role::cache::configuration::active_nodes is not per-realm yet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46776 [16:30:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46776 [16:34:58] !log mw1072 going to down to check h/w state currently failed rt4381 [16:34:59] Logged the message, Master [16:36:50] New review: Nemo bis; "Reedy tested it: https://bugzilla.wikimedia.org/show_bug.cgi?id=15434#c59" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/33713 [16:39:58] Reedy: I think now ops should not be too worried about merging it [16:40:11] As long as it's run on a pmtpa host.. [16:41:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 188 seconds [16:41:46] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 190 seconds [16:42:12] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 202 seconds [16:42:32] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 203 seconds [16:44:26] New patchset: Mark Bergsma; "Use lvs::configuration::service_ips for role::cache::upload backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46779 [16:45:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46779 [16:46:12] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [16:46:41] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [16:47:01] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [16:47:37] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [16:50:19] New patchset: Mark Bergsma; "backends parameter is now unneeded for upload-backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46780 [16:51:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46780 [16:51:50] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:54:57] New patchset: Reedy; "Add a sqldump script wrapper around mysqldump" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43844 [16:55:55] !log Created sites/site_identifier tables on hewiki and itwiki [16:55:57] Logged the message, Master [16:58:26] New patchset: Reedy; "Wikibase Client config for hewiki and itwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46782 [17:01:46] New patchset: Reedy; "Wikibase Client config for hewiki and itwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46782 [17:02:53] New patchset: Mark Bergsma; "Remove unused probes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46783 [17:03:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46783 [17:10:48] New patchset: Reedy; "Add wikidata poll for changes for itwiki and hewiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46784 [17:11:43] New patchset: Reedy; "Add wikidata poll for changes for itwiki and hewiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46784 [17:13:19] New patchset: Mark Bergsma; "Use the varnish probe for 2nd tier bits servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46785 [17:13:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46785 [17:16:58] hi [17:17:04] I'm looking for RobH [17:17:08] anyone know where he is ? [17:17:33] robh probably wont be in until 1p EST [17:18:42] !log Created wb_changes_dispatch on wikidatawiki [17:18:42] Logged the message, Master [17:24:12] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [17:33:20] PROBLEM - Varnish HTTP bits on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:00] PROBLEM - SSH on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:30] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:40] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:50] RECOVERY - SSH on strontium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:35:11] RECOVERY - Varnish HTTP bits on strontium is OK: HTTP OK: HTTP/1.1 200 OK - 637 bytes in 0.001 second response time [17:35:44] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:09] New patchset: Mark Bergsma; "500ms is a bit tight at times" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46787 [17:37:23] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:01] New patchset: Mark Bergsma; "esams doesn't use the bis probe anymore, 500ms is a bit tight at times" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46787 [17:39:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46787 [17:40:50] cmjohnson1: Ok, lets work in here ;] [17:41:18] k [17:41:38] so i am on the asw-c [17:42:42] sorry, got pinged for rt triage, back [17:43:03] Some basic juniper info gathering commands [17:43:18] show interface descriptions [17:43:30] you can tab complete most juniper commands [17:43:44] yeah i know that [17:44:50] ok, so run that command and you can see all the ports already labeled [17:45:02] yep [17:45:04] you'll notice it doesn't show ports that are blank/brandnew/neversetup [17:45:11] can also show vlans [17:46:05] so we are putting these in c6 correct? [17:46:25] corret [17:46:36] so the cool part is these are all in a single range of ports [17:46:42] so adding to the proper vlan is a single command [17:46:54] the not so cool part, im not sure we can use shell logic to label the ports [17:47:02] so you may have to just label each one individually [17:47:06] (not that big a deal) [17:47:11] nah [17:47:11] so lets go over labels first [17:47:41] first command: configure [17:47:49] second command: edit interfaces [17:48:04] networking gear is all about scope =P [17:48:20] so once you are in edit interfaces, you can edit them directly, in this case [17:49:09] ahh, these are also disabled, interesting (easy to fix) [17:49:16] so edit ge-6/0/0 [17:49:19] then: show [17:49:27] and you'll see it has no name, but is disabled. [17:49:36] We'll name them all first, then put into proper vlan, then enable. [17:49:44] New patchset: Ottomata; "Adding kafka module for review." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46618 [17:49:55] so once you do the show, you see the 'disabled' part [17:50:10] yep [17:50:20] you can: set description "name" or name without " if its a single line of characters [17:51:01] we are at mw1161 [17:51:07] for the first of this batch [17:51:17] correct [17:51:38] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:51:39] hrmm, port shows enabled now when i run show [17:51:49] did you enable it, or did it do that automatically when you set a name? [17:51:53] yes [17:51:59] no i enabled it [17:52:11] and set description to mw1161 [17:52:16] nothing is plugged into it yet right? [17:52:20] no [17:52:29] ok, if there was would be best to not enable it until it was in proper vlan [17:52:32] but since its not, you are fine [17:52:43] since there isnt (anythign plugge din) i mean [17:53:00] So you have the port naming down, drop back to edit interfaces [17:53:07] (can type in up) [17:53:20] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:53:25] since you have that part done, we'll do vlan for them all now, then you can work on the labels for the rest [17:53:30] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [17:53:35] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [17:53:36] then i can review it before you do first commit [17:53:41] then you'll be set to do this when needed [17:54:08] so we assume we can load up all power plugs iwth servers [17:54:17] so thats 40 servers. [17:54:25] (sicne we lose 2 pair for switches) [17:54:49] so ge-6/0/0 through ge-6/0/39 [17:55:42] k [17:55:55] hrmm [17:56:02] i ahve not added via range, trying to tab out the command [17:56:15] oh thats right, not setup, wont tab, heh [17:56:28] so i thinnnkn [17:56:28] set interface-range vlan-private1-c-eqiad member-range ge-6/0/0 to ge-6/0/39 [17:56:35] do that, then we can show | comapre [17:56:37] to see your changes [17:57:51] lemme know when you do it so i can check [17:58:19] there is no way to add descriptions as a group....:-\ [17:58:39] well, there prolly is with a for loop. [17:59:25] but meh. [18:00:32] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [18:00:49] cmjohnson1: so when you finish them all, you can show | compare [18:00:57] and see changes (lemme see em too before you do first commit) [18:01:02] so ping me when done [18:01:03] k [18:01:05] k [18:01:06] (or if you have questions) [18:08:31] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [18:10:48] robh: okay...check it [18:11:47] checking [18:12:14] cmjohnson1: I don't see a vlan change [18:12:50] oops..ok [18:12:52] now [18:12:59] cmjohnson1: So you should go ahead and do the vlan change and enable the ports [18:13:03] then commit it all together [18:13:37] you may be able tyo enable in a range [18:13:39] i would try. [18:13:47] goint to [18:14:10] i was mistaken about what order to do it [18:14:15] sicne its all committed at the same time it doesnt matter. [18:14:27] (the name, enable, vlan, blah) sorry dude [18:18:42] cmjohnson1: if not sure how to do as a range [18:18:49] we could always beg LeslieCarr for a hint ;] [18:18:57] not yet [18:19:14] hah [18:28:23] preilly: https://gerrit.wikimedia.org/r/#/c/46678/ from last night [18:28:36] i think your last two commits need to be merged into that [18:29:56] in sartoris.py all that remains unimplemented is the revert() method [18:30:05] aiming to have that finished today. Although testing can move forward on the other pieces [18:30:06] https://gerrit.wikimedia.org/r/#/c/42791/ [18:32:44] ok lesliecarr: how do i enable multiple interfaces at one time [18:33:18] (robh) ^ [18:34:13] New review: Asher; "Not going to work as is:" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/42791 [18:36:21] cmjohnson1: good question, trying to figure it out now [18:36:25] i hope its possible! =P [18:38:32] New patchset: Reedy; "Fixup usages of constants in wgNamespacesWithSubpages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46792 [18:39:21] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46792 [18:40:32] who knows about how the blog is set up? e.g. i'm wondering if you use hyperdb. looks like hyperdb isn't mentioned in the latest production puppet (just did a pull) [18:40:42] i guess maybe binasher [18:41:35] cmjohnson1: So how did you set the ones enabled that are there now? [18:41:52] one at a time [18:41:57] with what command? [18:42:04] i wanna make sure im doing same and see if i can loop it. [18:42:19] (though i duno if it will let me in juniper cli) [18:42:40] set enable ge-6/0/0 [18:43:09] hrmm [18:43:41] cmjohnson1: in what scope? [18:43:48] cuz that doesnt work for me in interfaces scope [18:47:26] robh: sorry this is the command in edit set interfaces ge-6/0/2 enable [18:47:45] rfaulkner: you need to rebase https://gerrit.wikimedia.org/r/#/c/46678/ [18:47:46] ahh, ok [18:48:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46684 [18:48:41] !log downgrading memcached to ubuntu version on virt0 [18:48:42] Logged the message, Master [18:53:10] cmjohnson1: Soooo there is no range command like on vlans [18:53:20] leslie is in office now so i ambushed her with quesitons the second she walked in. [18:53:25] gotta for loop it [18:53:34] awesome [18:53:41] New patchset: Reedy; "Remove setting of NS_SPECAIAL to have no subpages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46798 [18:54:19] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [18:54:30] cmjohnson1: arghhhhh, ok now i get it [18:54:36] so LeslieCarr does a for loop output into bash [18:54:41] to generatge a huge ass list of commands [18:54:47] which she then cuts and pastes into the juniper cli. [18:54:51] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46798 [18:55:03] this makes more sense, as i have spend the past 5 minutes trying to make it run in juniper cli. [18:55:10] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [18:55:56] robh: ok [18:56:02] !log reedy synchronized wmf-config/InitialiseSettings.php [18:56:04] Logged the message, Master [18:58:54] hey cmjohnson1 [18:59:02] hey lesliecarr [18:59:10] for i in `seq 0 47` ; do echo "delete ge-1/0/$i disable" ; done [18:59:14] New patchset: Reedy; "Remove setting of NS_MAIN to false in $wgNamespacesWithSubpages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46799 [18:59:18] would be the loop i'd use to generate the 5 million delete commands [18:59:22] rob just told me about today [18:59:49] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46799 [19:00:31] preilly: https://gerrit.wikimedia.org/r/46800 [19:01:17] LeslieCarr: you don't need to fork a sub shell to run the seq command. bash native would be: for i in {0..47} [19:01:20] rfaulkner: Change has been successfully merged into the git repository. [19:01:32] awesome thanks. [19:01:51] rfaulkner: did you fix https://gerrit.wikimedia.org/r/#/c/46678/ ? [19:02:23] rfaulkner: you need to rebase it [19:02:26] hmm. that shouldn't have happened [19:02:41] rfaulkner: Project policy requires all submissions to be a fast-forward. [19:02:41] Please rebase the change locally and upload again for review. [19:02:55] will do [19:06:06] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything else to 1.21wmf8 [19:06:07] Logged the message, Master [19:06:23] rebased, looks like a pep8 violation creeped in. submitted for review: https://gerrit.wikimedia.org/r/46801 [19:06:36] preilly ^ [19:07:19] ottomata: can you deploy the latest from https://gerrit.wikimedia.org/r/gitweb?p=analytics/E3Analysis.git;a=summary to stat1001 [19:07:26] rfaulkner: what's the deal with: https://gerrit.wikimedia.org/r/#/c/46800/1/sartoris/sartoris.py [19:08:21] rfaulkner: I mean https://gerrit.wikimedia.org/r/#/c/46678/2/sartoris/sartoris.py [19:09:31] ah right. so log_deploys() requires that the last "n" deployments be logged [19:09:50] this option is a way of specifying "n" via commandline [19:10:05] rfaulkner: you still need to rebase: https://gerrit.wikimedia.org/r/#/c/46678/2 as well [19:10:11] New patchset: Reedy; "Everything else over to 1.21wmf8" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46802 [19:10:11] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [19:10:19] rfaulkner: and submit a new change set [19:10:28] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46802 [19:10:28] rfaulkner: with a git commit --amend [19:11:11] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [19:12:11] New patchset: Aude; "Enable WikibaseClient on hewiki and itwiki, update settings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46803 [19:12:47] !log reedy synchronized php-1.21wmf8/extensions/Wikibase/ [19:12:48] Logged the message, Master [19:14:23] rfaulkner: nevermind I fixed https://gerrit.wikimedia.org/r/#/c/46678/ for you ;-) [19:14:55] rfaulkner: zero open change-sets https://gerrit.wikimedia.org/r/#/q/status:open+project:sartoris,n,z [19:14:56] preilly: thanks. sorry get pulled in several directions here. [19:15:02] New patchset: Reedy; "Remove unneeded wgNamespacesWithSubpages entries from CommonSettings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46804 [19:15:10] wfh right now, about to head [19:15:11] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46804 [19:15:11] in [19:15:35] preilly: will come to chat with you in the early afternoon [19:17:28] rfaulkner: okay sounds good [19:20:05] New patchset: Reedy; "Wikibase Client config for hewiki and itwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46782 [19:20:07] Change abandoned: Reedy; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46782 [19:22:02] !log reedy synchronized wmf-config/ [19:22:03] Logged the message, Master [19:22:37] New patchset: Reedy; "Remove stray sitenotice comment" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46806 [19:23:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46806 [19:24:04] New patchset: Aude; "Enable dispatch changes script for wikidata, disable pollforchanges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46807 [19:24:30] Change abandoned: Reedy; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46784 [19:27:11] New review: Reedy; "You want to make sure the old crons are removed. Need something like" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/46807 [19:34:50] New patchset: Reedy; "Enable WikibaseClient on hewiki and itwiki, update settings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46803 [19:35:53] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 186 seconds [19:36:26] New patchset: Aude; "Enable dispatch changes script for wikidata, disable pollforchanges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46807 [19:36:27] silly pmtpa slave [19:37:41] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [19:38:31] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46803 [19:39:25] binasher: what ext is calling PopulateFundraisingStatistics::updateDays ? [19:39:28] New patchset: Aude; "Enable dispatch changes script for wikidata, disable pollforchanges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46807 [19:39:34] seems to be on a cron [19:41:44] adding crap to dberrors.log [19:42:23] New patchset: Silke Meyer; "Definition of a function that gets MW extensions with less code" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46809 [19:42:43] !log added updated udplog_1.8-5 .debs to apt.wikimedia.org. No real changes, these now include the packet-loss program in the package. [19:42:44] Logged the message, Master [19:42:59] New patchset: Asher; "unmodified from mha4mysql-node, by Yoshinori Matsunobu (here to be modified to meet a security requirement)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46810 [19:43:09] AaronSchulz: ContributionReporting [19:43:47] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46810 [19:45:09] New review: Silke Meyer; "This is not yet working. mediawiki.pp is okay but something is missing in wikidata.pp to "see" the n..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/46809 [19:45:12] !log reedy synchronized wmf-config/ 'Enabling wikidata client on itwiki and hewiki' [19:45:13] Logged the message, Master [19:46:05] New patchset: Reedy; "(bug 44411) Added new import sources to dewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46187 [19:48:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46187 [19:51:06] New patchset: Asher; "build dsn from a root mysql defaults file instead of from cli options." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46815 [19:51:22] Is anyone about to review and deploy https://gerrit.wikimedia.org/r/#/c/46807 for the wikidata deployment? [19:51:32] Please [19:52:23] New patchset: Asher; "gdash graphs for the api" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46816 [19:52:29] AaronSchulz: i'll take a look [19:53:41] AaronSchulz: about the job duplicate insert stat, were you looking into that stats call never being passed a positive int? [19:54:23] oh yeah, PopulateFundraisingStatistics::updateDays is a cron [19:54:35] wtf [19:54:42] please :) [19:54:52] we're ready for the new dispatcher script [19:57:08] AaronSchulz: extensions/ContributionReporting -- the cron has been in place for over a year [19:57:50] * AaronSchulz filed a bug [19:58:16] doesn't look like the extension has been modified lately either [20:01:11] i wonder if the schema was changed on the public_reporting_days table on the fr db [20:01:17] Jeff_Green: ^^ [20:02:32] certainly possible. fr-tech was converting some tables to innodb. checking... [20:02:40] Jeff_Green: see hume:/tmp/PopulateFundraisingStatistics-updatedays.log - this job is failing and spamming the dberror log every 15min [20:02:46] k [20:04:32] PopulateFundraisingStatistics::updateDays makes an insert for many days with their fr totals, etc but day is the primary key, so fail. doesn't look like it would have ever worked without changes on the code (or using insert ignore) or with a different schema [20:05:37] Jeff_Green: binasher: we didn't touch it AFAIK. unfortunately adam last touched that a month or so ago and he is … not here [20:05:54] He'll be here soon. [20:06:39] k, it might be easiest to have him take a look when he gets in [20:06:49] maybe its been broken for a while but something else broken prevented the query errors from making it to our central dberror.log [20:07:13] entirely possible [20:07:39] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 20:07:33 UTC 2013 [20:07:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:07:49] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 20:07:46 UTC 2013 [20:08:14] Heja ops, is anybody willing to investigate further HTCP cache purging issues, or are you already tired? ;) [20:08:15] "htcp cache purges for images do not seem to clear europe upload squid caches" https://bugzilla.wikimedia.org/show_bug.cgi?id=44508 which has a great initial analysis, no ranting mob attached (yet), and a very helpful reporter, and I can reproduce the problem (here in Europe). [20:08:15] And if you're all tired of investigating this I should probably fire an RT ticket, or an email to ops@? [20:08:39] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:08:49] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 30 20:08:40 UTC 2013 [20:09:06] !log reedy synchronized wmf-config/InitialiseSettings.php [20:09:06] Logged the message, Master [20:09:17] andre__: bawolff is a treasure, remember ;) [20:09:35] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46815 [20:09:39] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:09:44] Nemo_bis: yeah. Still it's lovely how he tracked it down. [20:10:28] binasher: I'm confused--I don't even see a public_reporting_days table on the fundraising db's [20:10:33] huh... [20:10:59] New patchset: Aaron Schulz; "Only sleep if no jobs where found." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46819 [20:11:12] New patchset: Reedy; "Enable subpages in NS_PROJECT by default" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46820 [20:11:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46820 [20:11:45] binasher: https://gerrit.wikimedia.org/r/#/c/46819/ [20:13:56] andre__: please to be filing ticket and email to ops does not hurt [20:14:06] and after lunch [20:14:11] LeslieCarr, alright, thanks! [20:15:13] why is the apaches dsh node missing so many apaches =P (also the script references tampa directly for some reason still) [20:16:52] anyone want to help us with https://gerrit.wikimedia.org/r/#/c/46807/ ? [20:16:59] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 183 seconds [20:17:26] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 195 seconds [20:17:36] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 201 seconds [20:17:53] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 209 seconds [20:17:55] New patchset: Reedy; "Make NS_PROJECT have subpages for de/it wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46821 [20:18:14] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46821 [20:18:51] * RobH discovers they are missing cuz they are offline, but have no tickets or references to being offline [20:20:24] !log mw23 was stuck in installer, restarting and setting up. [20:20:25] Logged the message, RobH [20:21:32] !log reedy synchronized wmf-config/InitialiseSettings.php [20:21:33] Logged the message, Master [20:25:04] cmjohnson1 & sbernardin : I am pushing a racktables data output into a gdoc spreadsheet [20:25:13] so the three of us can start populating racktables with the missing info. [20:25:22] k [20:25:29] ie: something to do when not onsite, but needs to happen faster than I can do alone. [20:25:43] OK [20:25:44] I'll email you both with the details, but the three of us will need to chip away at this thing [20:26:35] cool [20:32:05] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46807 [20:32:26] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [20:32:29] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:32:36] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:32:56] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [20:32:56] PROBLEM - Host mw1072 is DOWN: PING CRITICAL - Packet loss = 100% [20:33:36] RECOVERY - Host mw1072 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [20:34:27] RobH: osm-db1, osm-db2, and osm-cp1 - osm-cp4 are complete [20:34:50] sbernardin: awesome, i'll take a look at them shortly, thanks! [20:35:17] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [20:35:17] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 182 seconds [20:36:00] PROBLEM - Apache HTTP on mw1072 is CRITICAL: Connection refused [20:36:00] New patchset: Aaron Schulz; "Removed maxdelay hack and tweaked the values." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46824 [20:36:00] New patchset: Aaron Schulz; "Only sleep if no jobs where found." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46819 [20:36:10] notpeter: https://gerrit.wikimedia.org/r/#/c/46819/ [20:37:25] AaronSchulz: hah, awesome [20:37:29] stachanovism [20:38:17] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [20:38:17] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [20:38:47] AaronSchulz: should I deploy and restart teh job runnarz? [20:39:03] sure [20:39:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 217 seconds [20:39:59] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 225 seconds [20:41:21] * AaronSchulz gets loads of fingerprint errors on sync [20:41:36] sbernardin: so https://rt.wikimedia.org/Ticket/Display.html?id=4330 is all done? [20:41:50] if so pls resolve ticket with details, thx =] [20:42:13] cmjohnson1: uhhh [20:42:23] Yes [20:42:24] when you guys just changed them around, did the ip addresses for mgmt still ink up [20:42:33] sbernardin: thats for you too, that quetion [20:42:35] Will resolve it now [20:42:41] AaronSchulz: you're a fingerprint error on sink [20:42:41] ie: you guys swapped osm-cp and osm-db around [20:42:43] sync [20:42:44] man [20:42:48] does osm-cp1 still go to osm-cp1? [20:42:50] etc? [20:42:52] jetlag is a hell of a drug... [20:43:07] * aude thinks osm = openstreetmap [20:43:16] confused but knows the difference here :) [20:43:19] No ...just relabeled in racktables [20:43:27] * AaronSchulz gets food and hopes that stuff works on his return [20:43:27] sbernardin: yea, thats not going to work [20:43:36] so we need to ensure the drac is right [20:43:39] OK [20:43:46] ie: if i ssh into osm-cp1 it goes into the right server [20:43:57] robh: the dracs are going to be wrong [20:44:15] i forgot about that [20:44:39] so the network cfg will need to be fixed on the servers [20:44:48] sbernardin ^ [20:45:06] robh is probably diligently getting the right ip addys now [20:45:06] So I need to update the drac ips [20:45:17] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 194 seconds [20:45:17] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 195 seconds [20:45:18] notpeter: thanks! [20:45:20] yes [20:45:30] im actually logging into drac on them and determining which ones are wrong [20:45:38] compareing service tag in drac to racktables [20:45:56] can you, Reedy or whoever keep an eye on the job runner logs, now that we have wikidata jobs on a few wikipedias [20:45:59] :-] [20:46:00] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46819 [20:47:07] aude: What are they called? [20:47:07] New patchset: Reedy; "Use overriding to muchly simplify wgNamespacesWithSubpages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46826 [20:47:11] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 23 seconds [20:47:16] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [20:47:17] Nemo_bis: https://gerrit.wikimedia.org/r/46826 [20:47:17] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [20:47:22] aude: sure [20:47:42] notpeter: when you are done looking at mw1072 can you add back to dsh group plz (assuming all it good0 [20:47:56] Reedy: no idea about the job runner, but dispatcher is dispatcher.log [20:48:05] in the wikidata directory (where we had pollofrchanges) [20:48:09] aude: No, the jobs. What class are they? [20:48:13] Reedy: wow, does it work also with those nasty arrays? :) [20:48:17] sbernardin: you are just going to have to reaudit them all, cuz they are all fucked up. [20:48:20] Reedy: the worst are search defaults [20:48:23] aude: pollfor or pollofr? [20:48:25] lemme get ips into the ticket i linked for you [20:48:35] OK [20:48:38] Nemo_bis: That's on my todo list next [20:48:41] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [20:48:57] Reedy: :D an how did you do it? [20:49:04] you mentioned a script? [20:49:15] http://p.defau.lt/?_qfVhVqaMppPh9HNfJjaqQ [20:49:20] Essentially that [20:49:22] Reedy: ChangeNotificationJob [20:49:24] with the arrays populated [20:49:33] heh, if I knew PHP :) [20:50:17] !log authdns-update [20:50:18] Logged the message, RobH [20:50:19] jeremyb: it's a bit late here.... [20:50:31] Achievements: made InitialiseSettings.php prettier. [x] [20:50:44] aude: i think it's nearly 10pm :) [20:50:50] jeremyb: yes [20:51:00] late to still be at the office [20:51:02] aude: remember my clock is on UTC :) [20:51:10] aude: ouch! [20:51:17] aude: No sign of any yet... [20:51:23] none in the queues either [20:51:24] hmmm [20:51:26] takes 5 minutes [20:51:27] sbernardin: https://rt.wikimedia.org/Ticket/Display.html?id=4330 is updated with IP addresses [20:51:33] for dispatching changes to run [20:51:45] I'm AFK for a 10 minutes or so [20:51:46] has it been 5 minutes yet? [20:51:50] ok [20:51:52] Puppet is slow [20:51:56] yeah [20:52:03] puppet and then 5 min [20:52:52] cmjohnson1: yep [20:57:26] cmjohnson1: mw1072 is happy now [20:57:33] sometimes puppet just needs to run a couple of times [20:57:37] awesome [20:57:38] okay [20:57:47] thx for looking into it for me [20:57:49] there can be ordering issues, but they usually sort themselves out [20:57:55] yep! [20:59:07] !log reedy cleared profiling data [20:59:07] Logged the message, Master [21:01:46] New patchset: Reedy; "Use overriding to muchly simplify wgNamespacesToBeSearchedDefault" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46873 [21:01:51] Nemo_bis: ^ [21:02:57] Around 19KB removed from InitialiseSettings [21:03:15] AaronSchulz: https://noc.wikimedia.org/cgi-bin/report.py Do you know how to fix the /0? [21:03:26] I'm wondering if no data is getting there [21:03:31] * Reedy will look when he returns [21:03:56] RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [21:05:48] Dereckson: perfect time to sneak a small config change in, so that Reedy has to do some nasty rebasing [21:07:05] Reedy: I'm wondering, now where do we find a rogue op to deploy the special pages update? :) [21:08:11] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 186 seconds [21:08:12] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [21:08:17] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 190 seconds [21:08:17] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 191 seconds [21:09:51] icinga-wm: you look a bit redundant [21:10:13] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46816 [21:10:16] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 184 seconds [21:10:26] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [21:11:39] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 240 seconds [21:11:56] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 255 seconds [21:18:16] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 219 seconds [21:18:25] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 221 seconds [21:18:50] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 244 seconds [21:18:50] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 245 seconds [21:19:43] notpeter: Reedy do we know if our cronjob is enabled now? e.g. propagated via puppet? [21:19:48] any issues? [21:20:00] Nemo_bis: now nagios is the redundant one :) [21:20:15] * aude doesn't see anything in recentchanges yet on itwiki [21:22:15] aude: looking [21:22:24] thanks [21:22:37] * aude checks hungarian wikipedia to see if the old cronjob has stopped [21:23:20] or it might actually take some time to catch up, perhaps [21:24:24] aude: the new crons are in place on hume [21:24:33] hmmm, ok [21:24:39] what would the output look like? [21:24:49] nothing coming up in a grep of the runJobs log [21:24:54] is there stuff in the wb_changes_dispatch table on wikidatawiki? [21:24:56] for just wikidata [21:25:11] hmmm [21:25:41] aude: should they display on histories too? [21:25:46] Nemo_bis: not yet [21:25:54] ok [21:26:32] on our todo but inserting stuff into revision history can have a lot of unintended side effects if we don't get it 100% right [21:26:44] soon i hope though [21:27:24] multicast is not broken on dobson [21:28:28] awwww toolserver does not have the job table :/ [21:28:38] probably private details in it [21:28:47] aude: that table is populated by 4 rows [21:29:00] notpeter: that's good [21:29:07] yeah, looked right [21:29:18] hewiki, huwiki, itwiki, and test2wiki [21:29:20] then the issue is are the jobs send to the clients [21:29:22] yep [21:29:28] are they just in the queue [21:29:29] ? [21:29:37] any errors in any logs? [21:29:49] do we just have to be patient? is the queue long? [21:30:08] * aude could check the recentchanges table on the toolserver [21:30:09] the queue shouldn't be long [21:30:29] ok [21:31:16] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [21:31:25] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [21:31:36] nothing for itwiki recentchanges yet [21:32:16] the jobrunner log hasn't seen any activity for wikidatawiki in about 1.5 hrs [21:32:21] soooo, I'd say something isn't right [21:32:30] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [21:32:39] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [21:33:48] RobH: drac settings have been updated for the 6 osm servers [21:33:51] hmmm [21:33:55] the jobs should be in the clients [21:34:08] sbernardin: awesome, thanks! [21:34:17] itwiki, huwiki, hewiki [21:34:22] ah, ok [21:34:48] it might be going through everything in the wb_changes table, so it's possible it just atkes time to catch up [21:35:11] there will be no duplicates inserted in the clients, though [21:35:40] if the jobrunner has not crashed (e.g. other jobs are fine) and the error log is quiet, then i assume we just need to be patient [21:36:22] the script does not impact the links appearing in the sidebar, except pages won't automatically get purged until the script handles it [21:36:26] but manual purges work [21:36:53] multicast is not broken on hooft either [21:37:42] ah, i see wikidata changes in my watchlist on itwiki [21:37:55] looks like it is working :D [21:40:10] the runjobs.log isn't showing any jobrunning in the last 1.5 hrs or so [21:40:20] binasher: am i right, there are no indexes on the recentchanges table? [21:40:26] notpeter: in the cleints? [21:40:33] the wikipedias? [21:41:35] notpeter: jobs are running [21:41:43] binasher: notpeter ok [21:41:54] notpeter: no logfile on fluorine has been updated in the last 1.5 hrs [21:42:12] oh, ok [21:42:14] phew [21:42:16] hmmmm.... no logfile updates but jobs are running anyway [21:43:17] hmm log packets are hitting fluorine [21:43:23] !log restarting udp2log on fluorine [21:43:24] Logged the message, Master [21:44:08] csteipp: when's the ganglia bugfixes coming ? [21:44:49] LeslieCarr: Most are merged into master, but they hadn't put together a release last time I checked [21:44:56] ok [21:45:19] oh yeah, someone broke udplog [21:45:33] who added new udplog packages on brewster at 11:40am today? [21:45:45] is there nothing in SAL ? [21:46:40] 19:42 ottomata: added updated udplog_1.8-5 .debs to apt.wikimedia.org. No real changes, these now include the packet-loss program in the package. [21:46:57] ah [21:47:17] no real change, except working [21:47:41] is feature? [21:49:00] !log restarting varnishhtcpd on cp3010 - it appears to be locked up , netstat -l shows a huge Recv-Q for port 4827 and appears to be the previously discovered issue [21:49:01] Logged the message, Mistress of the network gear. [21:50:32] anyone in europe right now? andre__ ? [21:51:19] * Nemo_bis is [21:51:31] Ditto [21:51:32] LeslieCarr: zeljkof would be if he's around [21:51:48] can you try some sort of image purge like https://bugzilla.wikimedia.org/show_bug.cgi?id=44508 ? [21:52:01] LeslieCarr, chrismcmahon: I am around [21:52:05] !log started udplog with correct mw config on fluorine [21:52:05] Logged the message, Master [21:52:22] h [21:52:22] aude: [21:52:23] 2013-01-30 21:52:07 mw1012 huwiki: ChangeNotification Speciális:ChangeNotificationJob repo= changeIds=Array t=42 error= [21:52:31] it looks like upload purging was only broken on one upload varnish cache - but wanna make sure [21:52:37] Reedy: looks good [21:52:39] no error? [21:52:52] LeslieCarr: what I need to do? [21:53:00] * aude sees stuff in the recentchanges and watchlist on both huwiki and itwiki (and presumably hewiki) [21:53:12] LeslieCarr: how about cp3009 ? it sent me some junk previously [21:53:24] do note that there is a discrepancy (sp?) between what i see in itwiki and huwiki (watching the same article) [21:53:33] reedy@fluorine:/a/mw-log$ tail -n 100000 runJobs.log | grep -c .ChangeNotification [21:53:33] 250 [21:53:36] some changes seem missing in itwiki [21:53:44] hmmmm, good [21:53:51] I can only see huwiki jobs though [21:54:00] huh? [21:54:01] hi? [21:54:05] i do see stuff in itwiki [21:54:25] cp3009 appeared to not be locked up [21:54:44] and it's not like itwiki is just showing me the first three chronologically changes [21:54:48] binasher, sorry was afk for a sec [21:54:55] it seems to have skipped some, but it's all the same article [21:54:56] is puppet installing latest? [21:54:59] udplog? [21:55:04] crap checking [21:55:13] X-Cache: cp1030 hit (2), cp3010 hit (2), cp3009 frontend miss (0) [21:55:15] we have a deploy scheduled for tomorrow, am preparing [21:55:22] could be an issue with the dispatching script but if there are no errors,, then might be okay at least for tonight [21:55:32] LeslieCarr: purge didn't seem to work, see above [21:55:35] # make sure the udplog package is installed [21:55:35] package { udplog: [21:55:35] ensure => latest; [21:55:35] } [21:55:35] } [21:55:37] AHHH FOILED [21:55:48] that's a bad way to do it, grrr [21:55:53] Nemo_bis: damn, new purge and it didn't get purged out by 3010 ? [21:56:09] LeslieCarr: indeed [21:56:16] aude: Yeah, no itwiki jobs. I see for hu and he though [21:56:36] lol, AaronSchulz [21:56:37] 2013-01-30 21:53:22 mw1009 hewiki: ChangeNotification מיוחד:ChangeNotificationJob repo= changeIds=Array STARTING [21:56:43] Nemo_bis: grrrrrr [21:56:58] LeslieCarr: now trying clicking purge a couple dozens times – sometimes it works [21:57:01] Reedy: hm? [21:57:06] hmmm [21:57:12] Ryan_Lane: I am attempting at cleaning up tickets. https://rt.wikimedia.org/Ticket/Display.html?id=1839 is for pushing the mediawiki tarballs via CT integeration [21:57:25] binasher says some job runner log is not updating [21:57:28] New patchset: Ottomata; "Ensuring udp2log package is present, not just latest." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46877 [21:57:29] btw, does anyone know why cp3003 and cp3004 aren't in the lvs configs ? [21:57:31] you were the last person to comment, was wondering who you thought should be followed up on to clear this ticket. [21:57:33] aude: i fixed it [21:57:34] but doesn't mean the jobs are not running, i guess? [21:57:37] binasher: ok [21:57:48] The hewiki job show the date backwards in runJobs.log [21:57:52] aude: re: recentchange, it does have several indexes [21:58:04] binasher: hmmm [21:58:15] !log mw23 offline, reinstalled with precise and signed into puppet, puppet throwing errors on initial run. [21:58:16] LeslieCarr: maybe it worked now X-Cache: cp1030 miss (0), cp3010 miss (0), cp3010 frontend miss (0) [21:58:17] Logged the message, RobH [21:58:21] whew [21:58:29] though [21:58:33] that's all cp3010 [21:58:42] the receive queue hasn't gone wonky yet [21:58:51] aude: http://p.defau.lt/?ut0nI3655DIn__MYwDC6PQ [21:59:00] hrm though [21:59:02] binasher: i see [21:59:05] actually it looks like it didn't reload [21:59:07] none on rc_type though [21:59:08] oh tim said that [21:59:26] LeslieCarr: but if I try a simpler wget I get X-Cache-Lookup: HIT from sq84.wikimedia.org:3128 [21:59:33] aude: PRIMARY KEY (`rc_id`), KEY `rc_timestamp` (`rc_timestamp`), KEY `rc_namespace_title` (`rc_namespace`,`rc_title`), KEY `rc_cur_id` (`rc_cur_id`), KEY `new_name_timestamp` (`rc_new`,`rc_namespace`,`rc_timestamp`), KEY `rc_ip` (`rc_ip`), KEY `rc_ns_usertext` (`rc_namespace`,`rc_user_text`), KEY `rc_user_text` (`rc_user_text`,`rc_timestamp`) [21:59:53] aude: right, none on rc_type [21:59:54] !log killing varnishhtcpd on cp3009 for realzies this time [21:59:56] Logged the message, Mistress of the network gear. [21:59:56] binasher: ok [22:00:13] also X-Cache: HIT from amssq62.esams.wikimedia.org [22:00:29] binasher: since we are filtering wikidata ('external' changes), might that make sense and how feasible would it be to add an index there? [22:00:30] cool, should be happy now :) [22:00:34] LeslieCarr: the result is different at every request, as reported in the bugs [22:00:36] so new installs on apache with puppet result in a shitton of package install errors [22:00:43] wtf happened to the config ;P [22:00:47] rc_type = 5 [22:00:50] well those are for squids and not varnish [22:00:56] i'm specifically concerned about upload [22:00:57] and also would be nice for log entries [22:00:59] since that seemed to be the problem [22:01:01] ? [22:01:51] LeslieCarr: I mean the reasult of a wget -S 'http://upload.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg' [22:02:10] of course, we should come up with a better way than external type though, since other stuff might use that someday [22:02:45] binasher: Are you familiar with https://noc.wikimedia.org/cgi-bin/report.py ? It's showing a /0 error [22:02:47] LeslieCarr: when there's sq84 or amssq62 in the headers I get junk http://p.defau.lt/?7fEpDhZKokakt6R7OhlZug [22:02:59] but okay for now and maybe makes sense not to have a bazillion filters on recentchanges page [22:03:14] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46877 [22:03:39] neek: it's interesting that you would get those in your headers [22:03:47] because i don't think that they are in upload-lb load balancing [22:03:50] lemme double check [22:04:00] that wasn't a thumb though [22:04:23] change tag perhaps...... [22:04:35] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:04:36] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:06:19] New patchset: Ottomata; "Fixing vodaphone india filter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46879 [22:06:39] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46879 [22:06:42] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:07:54] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:08:01] so that's weird Nemo_bis - i just tried and I got what looks like an old version from that upload [22:08:03] so weird [22:08:26] different md5, from july [22:08:35] ok, binasher, notpeter, LeslieCarr, I just checked up on udp2log things. The only weirdness I could find was a bad filter definition on oxygen [22:08:35] hrm, lemme look at lvs configs [22:08:39] everything else looks ok... [22:08:43] what was broken? [22:08:45] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [22:08:52] maybe it got upgraded and not restarted properly? [22:09:01] ottomata: i'm guessing it's possible that the updating just required a restart of udp2log ? i dunno, i am looking at varnish [22:09:04] yeah, probably so [22:09:06] i thought i was done with you bug!!! [22:09:14] grrr for unintended upgrades [22:09:22] man, ensure => latest on packages is a baaaaad idea [22:09:25] on dewiki, select count(distinct rc_type) from recentchanges == 3, over 1.2mil rows. hmm.. not sure if an index there would be very useful [22:09:33] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [22:09:59] hrm, so amssq62 wasa n upload cache server, but running squid - i thought we were over that... [22:10:02] ottomata: on fluorine, udp2log was running with no config [22:10:09] flourine!? [22:10:11] never even heard of it! [22:10:31] flourine is the newish log server [22:10:37] maybe 6-9 weeks old? [22:10:39] LeslieCarr: I also have hits with sq86, amssq49 [22:10:53] so those hits are correctly redirecting to tampa backends [22:10:57] if they can't find it [22:11:04] but why [22:11:18] ok lvs configs, show me what you got [22:11:31] crazy! its running udp2log? ahhh, ok its not a webrequest udp2log instance [22:11:49] !log aaron synchronized php-1.21wmf8/includes/filerepo/file/OldLocalFile.php 'deployed f884911f5562fc41980664c424e7037bde6ba110' [22:11:50] Logged the message, Master [22:11:58] binasher, is fluorine ok now? [22:12:04] yep [22:12:07] ok phew [22:12:09] phewwwwww [22:12:10] !log aaron synchronized php-1.21wmf8/includes/filerepo/file/LocalFile.php 'deployed f884911f5562fc41980664c424e7037bde6ba110' [22:12:11] Logged the message, Master [22:12:18] oh! [22:12:21] man bad week for me [22:12:25] the squids are still in the rotation [22:12:33] just with a low weight [22:13:13] binasher: would there be technical issues/ would mark murder me if i turned on the two new varnish upload hosts in esams and disabled the squids which seem to be the ones ignoring the cache purge requests ? [22:13:30] LeslieCarr: mark hasn't finished esams uploads to varnish migration due to the servers having h310 controllers [22:13:35] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [22:13:45] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [22:14:22] LeslieCarr: there might be issues.. i think it would be better to try to fix the purge handling issue for now, even though those squids will be gone very soon [22:16:00] the varnish hosts are only getting 1/3rd of the eu upload traffic and i don't think they can take the rest [22:16:07] binasher: those squids are only serving junk [22:16:20] Nemo_bis: all of them? [22:16:20] ah h310 controllers [22:16:23] or some of them? [22:16:39] i figured that with the extra 2 it should be not a problem, however h310's, blecch [22:16:40] thanks [22:16:44] binasher: all I encounter when some file should be updated [22:16:47] probably better serving some stale thumbnails than for uploads.wikimedia.org to be down in europe [22:17:04] probably [22:17:09] binasher: https://bugzilla.wikimedia.org/show_bug.cgi?id=42652 should be gone now [22:17:44] Memcached error for key "commonswiki:allpages:ns:6:US_Navy_090731-N-5700G-127_First_lady_Michelle_Obama_delivers_remarks_to_Sailors_and_their_families_at_Naval_Station_Norfolk_during_a_homecoming_celebration_for_the_Dwight_D._Eisenhower_Carrier_Strike_Group.jpg:US_Navy_090810-N-9950J-038_Aviation_Boatswain's_Mate_(Handling)_1st_Class_Michael_Quintos_launches_an_AV-8B_Harrier_jet_aircraft_during_th [22:17:46] e_fly_off_of_Marine_Attack_Squadron_(VMA)_211,_31st_Marine_Expeditionary_Unit_(31st_MEU).jpg" on server ":": A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE [22:17:54] binasher: why do we keep having these? why? :) [22:18:07] hah [22:18:08] looking on amssq62, it's at least receiving plenty of purge messages [22:18:21] AaronSchulz: grrr! [22:19:04] yeah, i'm on amssq47 - it's joined to the group, no big receive queue [22:19:19] which is what i saw on cp3010 (the recv-q) [22:19:26] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [22:19:40] binasher: so someone will fix PopulateFundraisingStatistics::updateDays soonish? [22:19:50] * AaronSchulz hates having noise in the logs [22:20:40] New patchset: Reedy; "Update wmgUseTemplateSandbox loader code" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46881 [22:20:58] pgehres: who was going to look at PopulateFundraisingStats again? [22:21:04] maybe i'll just disable the cron job [22:21:05] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46881 [22:21:19] binasher: adam wight, but I guess I can [22:21:47] binasher: any ideas on why squid is being bad ? [22:22:08] do you have a minute to walk through with me or are you busy ? [22:22:21] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [22:22:31] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [22:22:32] Morning TimStarling. Would you have any time to look at/fix the /0 error on https://noc.wikimedia.org/cgi-bin/report.py? Thanks [22:23:12] oh, mark responded to email about it [22:23:15] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [22:23:18] !log tstarling cleared profiling data [22:23:19] Logged the message, Master [22:24:01] LeslieCarr: is there a bugzilla ticket or are there a few example urls around? [22:24:11] Reedy: ok [22:24:11] wget -S --header 'host: upload.wikimedia.org' 'http://upload-lb.esams.wikimedia.org/wikipedia/commons/c/c2/Wappen_Landkreis_Aurich.svg' [22:24:18] LeslieCarr: and have you taken a look at if the tampa squids are getting purges yet? [22:24:43] basically if x-cache is cp3009/10 it servs an new image (see Age header), else serves an old image [22:24:57] good idea to doulbe check they are listening [22:25:44] LeslieCarr: that example.. [22:25:45] X-Cache-Lookup: HIT from sq84.wikimedia.org:3128 [22:25:46] Age: 777534 [22:25:48] yep, checked one of the "bad" squids and it is definitely subscribed 4827, seeing the traffic [22:25:58] checked sq84 [22:26:06] the varnishes that are ok use eqiad as their backend [22:26:11] New patchset: Reedy; "Combine CustomData loader code" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46883 [22:26:33] tcpdump -c 10 host 239.128.0.112 and port 4827 is a good way to tell [22:26:43] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46883 [22:26:58] well they're listening, i guess that says nothing about properly acting on requests [22:27:43] notpeter: can you deploy https://gerrit.wikimedia.org/r/#/c/46824/1 and I can monitor ganglia/gdash? [22:30:44] paravoid: are you planning to merge that small swift patch if you get the chance? [22:32:30] LeslieCarr: wait, for the example you gave above (Datei:Wappen_Landkreis_Aurich.svg), the long cached version in a pmtpa squid has X-Object-Meta-Sha1base36: fdy5w7grhgehvqu7gt0yv8c0epe8606 [22:32:42] and the newer version in an eqiad varnish does too [22:33:00] i md5'ed them and they are different file sthough [22:33:09] oh shall I solve the puzzle? [22:33:11] what did the ETag say? [22:33:15] squid.conf.php:htcp_clr_access allow tiertwo [22:33:15] for $500, alex ! [22:33:29] for $500 i'll bet eqiad is not included in acl tiertwo :/ [22:33:45] hah [22:34:11] squid.conf.php:printf("%-50s %s", "acl tiertwo src $subnet", "# $destCluster\n" ); [22:34:19] ah :) [22:34:31] PROBLEM - NTP on mw23 is CRITICAL: NTP CRITICAL: Offset unknown [22:34:41] so is this hand created on sockpuppet ? [22:34:42] acl tiertwo src 10.0.0.0/16 [22:34:43] New patchset: Reedy; "Remove unused $wgDebugLogGroups entries" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46884 [22:34:44] heh [22:34:56] add 10.64.0.0/16 to that [22:35:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46884 [22:36:00] sad how _purging_ broke in at least 3 different ways after the eqiad migration ;( [22:36:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44281 [22:36:30] yeah, i'm adding it [22:37:12] mark: we're still counting them indeed [22:37:28] * mark goes offline again [22:38:28] !log reedy synchronized wmf-config/ [22:38:29] Logged the message, Master [22:38:50] thanks mark :) [22:38:51] bye [22:39:35] mark: thanks [22:40:25] !log deployed new text/upload squid confs with eqiad in proxySubnets [22:40:26] Logged the message, Master [22:40:33] thanks binasher [22:40:36] thanks mark! [22:42:13] New patchset: Tim Starling; "Profiling: guard against division by zero" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46886 [22:43:07] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46886 [22:43:40] PROBLEM - Backend Squid HTTP on sq60 is CRITICAL: Connection refused [22:43:50] PROBLEM - Backend Squid HTTP on cp1010 is CRITICAL: Connection refused [22:43:50] PROBLEM - Backend Squid HTTP on sq49 is CRITICAL: Connection refused [22:44:00] PROBLEM - Backend Squid HTTP on sq75 is CRITICAL: Connection refused [22:44:20] PROBLEM - Backend Squid HTTP on sq71 is CRITICAL: Connection refused [22:44:21] PROBLEM - Backend Squid HTTP on sq42 is CRITICAL: Connection refused [22:44:30] PROBLEM - Backend Squid HTTP on sq77 is CRITICAL: Connection refused [22:44:42] PROBLEM - Backend Squid HTTP on sq60 is CRITICAL: Connection refused [22:45:09] PROBLEM - Backend Squid HTTP on cp1010 is CRITICAL: Connection refused [22:45:50] RECOVERY - Backend Squid HTTP on sq49 is OK: HTTP OK: Status line output matched 200 - 495 bytes in 0.053 second response time [22:45:54] Reedy: there's a temporary fix in place, still working on the root cause [22:46:02] Thanks [22:46:20] RECOVERY - Backend Squid HTTP on sq42 is OK: HTTP OK: Status line output matched 200 - 495 bytes in 0.054 second response time [22:46:21] PROBLEM - Backend Squid HTTP on sq77 is CRITICAL: Connection refused [22:46:22] PROBLEM - Backend Squid HTTP on sq75 is CRITICAL: Connection refused [22:46:22] PROBLEM - Backend Squid HTTP on sq71 is CRITICAL: Connection refused [22:46:32] TimStarling: operations-software/udpprofile/web/extractprofile.py and puppet/files/cgi-bin/noc/extractprofile.py are now rather out of sync [22:47:51] pity puppet doesn't support git as a file source [22:48:15] we could use a submodule, I guess... [22:49:48] PROBLEM - NTP on mw23 is CRITICAL: NTP CRITICAL: Offset unknown [22:51:39] binasher: AaronSchulz: this should fix the populateFundraiserWhataver bug https://gerrit.wikimedia.org/r/#/c/46887/1 [22:52:50] pgehres: wtf to the left side of the diff, and thank you! for the right side [22:52:50] TimStarling: supports rsync [22:53:12] oh, that's an improvement [22:53:42] binasher: yeah, my thoughts too and all i did was poke adam. if that is proper MW db stuff, then pls merge and push to hume [22:53:47] pgehres: I don't get that diff [22:53:58] there is no 'REPLACE' option in db.php [22:54:07] lol [22:54:10] sadness [22:54:13] pgehres: You know there's a replace function? [22:54:16] dbw->replace [22:54:24] * pgehres did not write this [22:54:35] * pgehres wonders if adam is stuck in drupal world [22:55:19] awight: is that drupal syntax and not MW syntax? [22:55:39] supposedly it should be "dbw->replace" [22:55:54] wgDebugLogGroups [22:55:55] ok, thanks [22:55:57] fail [22:56:00] function replace( $table, $uniqueIndexes, $rows, $fname = 'DatabaseMysql::replace' ) { [22:56:01] return $this->nativeReplace( $table, $rows, $fname ); [22:56:01] } [22:56:04] "supposedly"? ;) [22:56:23] * pgehres has barely touched core [22:56:27] passing any MySQL options to insert as the 4th param will work as well, but i like the "apized" thing [22:57:37] does this thing calculate fr totals, etc for every day on every run? [22:57:50] looks like it might ... [22:57:56] INSERT REPLACE.. [22:58:04] not doing that would also fix it :) [22:58:23] sure, but i imagine that was done because value change, a lot [22:58:39] we have a 3-40 day delay on donations :-( [22:58:47] ah, makes sense [22:58:53] replace away! [22:59:08] i would, however, think that we could limit it to the past 6mo [22:59:14] or less [23:01:48] RECOVERY - Backend Squid HTTP on cp1010 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.002 second response time [23:02:19] RECOVERY - Backend Squid HTTP on sq71 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.057 second response time [23:02:24] RECOVERY - Backend Squid HTTP on sq71 is OK: HTTP OK HTTP/1.0 200 OK - 1249 bytes in 0.062 seconds [23:03:00] RECOVERY - Backend Squid HTTP on cp1010 is OK: HTTP OK HTTP/1.0 200 OK - 1258 bytes in 0.055 seconds [23:03:08] RECOVERY - Backend Squid HTTP on sq75 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.107 second response time [23:03:18] RECOVERY - Backend Squid HTTP on sq60 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.109 second response time [23:03:28] RECOVERY - Backend Squid HTTP on sq77 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.107 second response time [23:03:47] RECOVERY - Backend Squid HTTP on sq77 is OK: HTTP OK HTTP/1.0 200 OK - 1258 bytes in 0.004 seconds [23:03:54] RECOVERY - Backend Squid HTTP on sq75 is OK: HTTP OK HTTP/1.0 200 OK - 1258 bytes in 0.011 seconds [23:03:55] RECOVERY - Backend Squid HTTP on sq60 is OK: HTTP OK HTTP/1.0 200 OK - 1258 bytes in 0.004 seconds [23:09:25] well it looks like icinga is doing well, it's time for phase 2 of kilnagios [23:17:03] New patchset: Lcarr; "adding in ganglios check to icinga checkcommands" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46889 [23:17:23] LeslieCarr: will icinga notify us that nagios is dead? [23:17:26] hehe [23:17:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46889 [23:19:27] hi LeslieCarr [23:19:38] was looking for you yesterday [23:19:43] oh ? [23:19:45] what's up ? [23:19:48] i was out yesterday [23:19:51] making up for the weekend a bit [23:19:54] fine, thanks [23:20:29] well, I'm adding some code to the nagios part of puppet [23:20:46] but I'm a nagios expert, not puppet expert [23:21:00] wanted to know if you would like to give a hand there [23:21:17] i can a little bit - however we're moving to icinga [23:21:24] i'm actually planning on trying to move puppet over [23:21:32] it is the same for me [23:21:33] (well if i hadn't realized that the ganglios module wasn't working) [23:22:00] icinga is backword compitible anyway [23:23:17] yeah [23:25:09] so do you have time to look into it? [23:26:33] not really right now - if you send me a changeset i might be able to ? i'm triyng to debug ganglios so we can do the switchover [23:26:40] :( [23:26:46] that's ok [23:27:00] I'll try to find a mentor :) [23:28:26] :) [23:28:32] have you tried #wikimedia-labs ? [23:31:07] I did, my instance isn't ready [23:31:33] the labs are down, due to gluster issue [23:31:41] oh doh [23:33:46] though speaking of needing help - oh python folks --- so the ganglios module is in /usr/share/pyshared on neon, yet for some reason python cannot "find" it ? "ImportError: No module named ganglios.ganglios" [23:33:55] and of course, on spence, working great [23:34:01] this is 2.7 versus 2.6 [23:40:14] LeslieCarr: python -c 'import sys;print repr(sys.path)' [23:40:31] * jeremyb doesn't know offhand what pyshared is [23:40:41] how did ganglios get there? from a package? [23:41:04] LeslieCarr: can you replicate in labs? [23:41:11] from a package [23:41:20] comparing neon and spenc enow [23:41:34] which package? [23:41:36] weird, it's not in either [23:41:37] ganglios [23:42:06] run that line i gave you? :) [23:42:18] yeah i did - neon ['', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', '/usr/lib/pymodules/python2.7'] [23:42:25] spence - ['', '/usr/lib/python2.6', '/usr/lib/python2.6/plat-linux2', '/usr/lib/python2.6/lib-tk', '/usr/lib/python2.6/lib-old', '/usr/lib/python2.6/lib-dynload', '/usr/lib/python2.6/dist-packages', '/usr/lib/pymodules/python2.6', '/usr/local/lib/python2.6/dist-packages'] [23:43:18] oh wait [23:43:27] it's in /usr/lib/pymodules/python2.6 on spence [23:43:30] lemme check on neon [23:44:26] oh weird in neon it's in /usr/share/pyshared/ganglios [23:44:29] how did that happen [23:44:32] bad package [23:44:34] no coookie [23:45:32] thank you jeremyb :) [23:45:45] idk what i did :) [23:46:09] your amazing presence ... :P [23:46:49] * jeremyb decides it's nearly 2am in matanya's land [23:47:02] it is [23:47:05] * jeremyb is sleepy here [23:47:15] bye [23:47:17] best time for server upgrades [23:47:21] night jeremyb [23:53:28] g'night