[00:01:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.541 seconds [00:33:26] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:33:26] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:47:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds [01:14:47] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 320 seconds [01:16:35] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 7 seconds [01:25:53] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [01:28:17] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 66.01 ms [01:34:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:07] jeremyb: yes. it's segfaulting [01:44:15] I need to run it in gdb [01:45:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.680 seconds [01:53:29] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [01:53:29] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [01:53:29] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [01:53:29] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [01:53:29] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [01:53:29] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:04:17] PROBLEM - MySQL disk space on db78 is CRITICAL: DISK CRITICAL - free space: /a 116492 MB (3% inode=99%): [02:20:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:27] !log LocalisationUpdate completed (1.21wmf6) at Wed Jan 2 02:27:26 UTC 2013 [02:27:38] Logged the message, Master [02:36:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [02:43:42] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:55:24] RECOVERY - Puppet freshness on neon is OK: puppet ran at Wed Jan 2 02:55:16 UTC 2013 [03:01:42] PROBLEM - Puppet freshness on solr2 is CRITICAL: Puppet has not run in the last 10 hours [03:03:39] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [03:10:15] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [03:13:42] PROBLEM - Puppet freshness on solr1003 is CRITICAL: Puppet has not run in the last 10 hours [03:13:42] PROBLEM - Puppet freshness on solr3 is CRITICAL: Puppet has not run in the last 10 hours [03:14:36] PROBLEM - Puppet freshness on solr1001 is CRITICAL: Puppet has not run in the last 10 hours [03:20:42] ah [03:21:53] PROBLEM - Varnish HTCP daemon on cp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:32] RECOVERY - Varnish HTCP daemon on cp1021 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [03:28:02] RECOVERY - Puppet freshness on sq81 is OK: puppet ran at Wed Jan 2 03:27:50 UTC 2013 [03:39:53] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 195 seconds [03:40:21] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 214 seconds [03:48:53] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [03:49:02] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [03:54:55] !b 43112 | gerrit-wm needs a boot! [03:54:55] gerrit-wm needs a boot!: https://bugzilla.wikimedia.org/43112 [04:02:59] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [04:27:26] PROBLEM - Puppet freshness on srv191 is CRITICAL: Puppet has not run in the last 10 hours [04:27:26] PROBLEM - Varnish HTTP upload-backend on cp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:30:53] RECOVERY - Varnish HTTP upload-backend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.056 seconds [04:31:56] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 201 seconds [04:32:14] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 206 seconds [04:45:22] New patchset: Ori.livneh; "Enable PostEdit on az, be_x_old, eo, pms, si & uk." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41917 [04:46:05] New patchset: Ori.livneh; "Enable PostEdit on az, be_x_old, eo, pms, si & uk." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41917 [04:56:23] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [05:03:35] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [05:03:53] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [05:05:23] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [05:30:07] * jeremyb waves to gerrit-wm! [05:34:56] hello -- any one knows total number of concurrent requests that wikipedia servers handle? [05:37:48] Many. [05:38:07] any approximate numbers? [05:38:28] Susan: this is being discussed already in #wikimedia-tech [05:38:44] Mahmoud: you were given a guess but ori's not certain if that number is really what he thinks it is [05:38:58] (the 141k from the /topic here) [05:39:48] 09:32 < ori-l> Mahmoud: IIRC the number in the topic of #wikimedia-operations is it [05:40:09] i didn't check the -tech channel since then :) [05:42:26] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [06:29:12] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [07:05:43] apergos: no !log ? ;-) [07:05:49] nope [07:09:57] New review: MZMcBride; "So should this changeset be abandoned, then?" [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/13293 [07:10:03] apergos: ^ [07:10:43] apergos: It's curious that the bot keeps going quiet. I wonder if it makes sense to keep the bug open until that's diagnosed. [07:10:51] I didn't close it [07:10:56] I know. [07:11:30] that was deliberate [07:11:47] Heh, just got closed. [07:11:52] if I had been a bit more awake I would have read the scrollback here, [07:12:05] seen ryan's comment about the segfault and looked into that first [07:12:07] but I wasn't [07:12:38] apergos: the segfault is on virt0 i think [07:12:41] fwiw [07:12:45] oh, nm then [07:12:48] :-D [07:12:58] in that case leave it closed [07:13:16] * jeremyb is a little confused... why is jenkins doing merges? [07:13:40] !g 41912 | e.g. [07:13:41] e.g.: https://gerrit.wikimedia.org/r/#q,41912,n,z [07:13:59] and I don't know if this changeset should be abandonded, ask thelast two people who commented on it [07:14:01] apergos: I re-opeend it. [07:14:03] With a comment. [07:14:25] apergos: I was asking them. I was just pointing out that gerrit-wm was working again. [07:14:31] :-) [07:15:01] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [07:15:01] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [07:15:20] I see there was a net split that resolved around the time gerrit wm went silent again [07:15:20] jeremyb: Yeah, not sure what's going on with that. [07:15:41] that's a known issue with ircbot or whatever it's called [07:15:58] affects more than just gerrit-wm and probably won't get resolved anytime soon [07:16:25] Ah, okay. [07:16:44] If there's a bug open about that, the gerrit-wm bug can be closed with a pointer. [07:16:48] so search for that in bz, if there isn't something about one of the bots and netsplits, make one, otherwise you could close up the gerrit wm one [07:16:53] yep [07:16:53] Heh. [07:17:22] hm I typed 'as dup' and it lost that in the line [07:17:23] weird [07:17:25] anyways.... [07:18:52] errr, so why did i not get notifs about the last 2 comments on bug 43112? [07:19:51] * jeremyb returns to sleep :) [07:21:42] I would return to sleep if only it weren't 9 am :-D [07:21:53] * apergos goes foraging for some small breakfast-like item [08:01:59] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 188 seconds [08:02:17] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 195 seconds [08:40:12] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [08:40:12] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [08:48:24] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/40795 [09:03:27] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [09:11:24] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [09:11:24] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [09:11:24] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [09:11:24] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [09:15:27] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 1 seconds [09:15:36] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [09:36:20] !log krinkle synchronized php-1.21wmf6/resources/mediawiki.page/mediawiki.page.watch.ajax.js 'I2b1b34c9' [09:36:31] Logged the message, Master [10:34:21] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:34:21] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:59:38] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [11:26:28] New patchset: Ori.livneh; "Bits VCL for EventLogging: require query to match" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41942 [11:47:38] New patchset: Ori.livneh; "EventLogging varnishncsa: require query to match" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41942 [11:54:59] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [11:54:59] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [11:54:59] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [11:54:59] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [11:54:59] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [11:55:00] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:57:06] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [13:03:07] PROBLEM - Puppet freshness on solr2 is CRITICAL: Puppet has not run in the last 10 hours [13:05:03] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [13:15:06] PROBLEM - Puppet freshness on solr1003 is CRITICAL: Puppet has not run in the last 10 hours [13:15:06] PROBLEM - Puppet freshness on solr3 is CRITICAL: Puppet has not run in the last 10 hours [13:16:00] PROBLEM - Puppet freshness on solr1001 is CRITICAL: Puppet has not run in the last 10 hours [13:24:15] RECOVERY - Puppet freshness on neon is OK: puppet ran at Wed Jan 2 13:23:49 UTC 2013 [13:50:29] ori-l: why is vanadium's listener address on the same line as the title and kraken's on the next line? [13:52:12] heh, i see you reworked the commit msg [14:04:09] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [14:11:41] can someone restart labsconsole's memcached which appears to be down again? [14:13:51] yes [14:14:07] done [14:15:33] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.001 second response time on port 11000 [14:21:14] thanks! [14:29:03] PROBLEM - Puppet freshness on srv191 is CRITICAL: Puppet has not run in the last 10 hours [14:34:15] !log MaxSem> can someone restart labsconsole's memcached which appears to be down again? 14.14 paravoid> done [14:34:21] looks worth logging [14:34:25] Logged the message, Master [14:34:36] thanks, although it happens so often that it probably isn't [14:41:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [14:57:26] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [15:05:10] New patchset: Ottomata; "Setting up .htpasswd file for E3 and metrics-api.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41955 [15:06:00] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41955 [15:06:26] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [15:06:26] cmjohnson1: morning [15:06:33] cmjohnson1: got a minute? [15:07:19] good morning paravoid..yes [15:07:27] happy new year :) [15:07:46] hope you had fun [15:07:53] New patchset: Ottomata; "Fixing 'content' parameter name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41956 [15:08:00] :-] same to you....no, i have kids...asleep b4 midnight [15:08:17] hehe [15:08:22] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41956 [15:08:49] I filed a ticket yesterday (yes, yesterday...) about ms-be1003's cable [15:09:49] k..i see it...i can replace easily [15:10:09] do you want to keep the server on or power down? [15:10:13] on [15:10:18] ethernet cable [15:10:32] okay..i need to go to storage and get the cable...i will ping you b4 i swap it [15:10:41] thanks! [15:10:54] 100mbit is just not enough for ms-bes :) [15:11:13] New patchset: Ottomata; "Need to use single quotes for .htpasswd content." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41957 [15:11:30] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41957 [15:15:18] paravoid: the cable is ready to swap....we could bond eth0/eth1 [15:15:26] no, no reason to [15:15:28] just swap it [15:15:37] k..doing that now [15:15:50] !log swapping ethernet cable for ms-be1003 [15:15:59] Logged the message, Master [15:16:56] paravoid: done [15:17:16] cmjohnson1: perfect, renegotiated at 1000 [15:17:17] thanks [15:17:22] yw [15:20:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:48] paravoid, I've got a volunteer who has problems logging in Gerrit [15:20:59] MaxSem: on IRC? [15:21:07] yes [15:21:21] madcaplaughs, ^^^ [15:21:21] MaxSem: not a very good time now I'm afraid... [15:21:25] although... [15:22:04] madcaplaughs: symptoms? [15:22:17] not letting me log in [15:22:37] says cookie must be enabled which is enabled on my browser [15:22:45] so, I've done this only once before, and can't remember the details [15:22:52] how does one go about adding a new domain? [15:22:59] I need to add metrics-api. for E3 [15:23:02] pointed at stat1001 [15:23:15] ottomata: dobson maybe? wildish guess [15:23:19] If there isn't already a page on wikitech I'll write one this time :p [15:23:27] ottomata: there's probably a local repo there that's not on gerrit [15:23:54] madcaplaughs: try clearing all gerrit.wikimedia.org cookies and start again [15:24:11] madcaplaughs: also, tried multiple browsers? [15:24:57] tried on chrome and firefox [15:26:15] username? [15:26:20] now its saying incorrect username and password [15:26:28] username: Debarshiban [15:26:39] thanks jeremyb, i think that's a good guess, but not it. I found an old pdns .bak dirertory there, but not much else [15:26:53] there is a pdns-templates svn repo on sockpuppet [15:27:09] i think that's it, but i'm not sure where I'd update it to get the changes live [15:27:11] maybe sockpuppet then [15:27:12] paravoid? ^ [15:27:27] ottomata: i do have the answer in my irc logs if you really need it ;) [15:27:31] haha [15:27:37] its not an emergency at all [15:27:39] hm? [15:27:49] paravoid: where's the canonical home for DNS? [15:27:53] paravoid: to edit [15:27:53] wondering how to add a new domain name [15:27:55] was going to: [15:28:04] add a CNAME entry in pdns-templates/wikimedia.org [15:28:33] there's quite good doc on the wiki [15:28:37] wikitech [15:28:49] ah! good, i searched but could only find info about setting up DNS as a service etc. [15:30:02] ahhhh wait [15:30:02] i found it [15:30:04] its in that doc [15:30:07] okokok [15:30:09] http://wikitech.wikimedia.org/view/Dns#Changing_records_in_a_zonefile [15:33:10] thanks paravoid. As the doc there says, could you review [15:33:10] /tmp/dns.diff [15:33:10] on sockpuppet for me? [15:33:53] sure [15:34:49] looks good, but use tabs instead of spaces [15:35:05] not that I have anything against spaces, but the rest of the file is like that [15:35:08] so let's be consistent [15:35:41] the resolvers section there is wrong. at least i'm pretty sure lily is dead [15:35:46] oop, yeah ok [15:36:17] ok thanks, with tabs then, i'm going to commit it and do the rest [15:36:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.015 seconds [15:36:43] ottomata: note that we have no checkzones or anything (I've tried, but dobson is... hardy) [15:36:54] so if you make a typo, DNS goes down [15:37:08] just saying, be careful :) [15:37:18] haha, yikes, ok, want to look once more over /tmp/dns.diff then? [15:38:14] +1 [15:38:31] DNS is on the immediate TODO btw [15:38:53] doing what? [15:39:00] refactor it in general [15:39:02] ah, aye [15:39:05] cool [15:39:15] git/gerrit, precise, more flexible scenarios (think ulsfo) [15:39:19] ipv6 [15:39:30] it needs some love [15:40:40] aye [15:40:43] ok cool, looks good! [15:43:20] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [16:10:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:16] paravoid, since you're on duty - how's your Squid-foo? [16:14:25] paravoid, can you take a look at https://bugzilla.wikimedia.org/show_bug.cgi?id=35215#c11 [16:20:32] MaxSem: I think I remember a discussion about this [16:20:40] Asher nak'ed iirc [16:21:56] <^demon> Apparently the market for "BZ client apis written in Java" is really small. There's like 3, and they all suck. [16:21:58] <^demon> Who knew. [16:22:02] mmm, http://www.urbandictionary.com/define.php?term=nak [16:22:14] updating ticket [16:22:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.053 seconds [16:22:44] MaxSem: http://en.wikipedia.org/wiki/NAK_%28protocol_message%29 [16:22:50] sorry for the slang :) [16:23:44] meh, why didn't anything reach us? [16:23:52] I'll poke him, thanks [16:24:37] Arthur mailed ops@ and Asher replied to him Cc'ing ops@ [16:24:44] but I updated bz now [16:30:17] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [16:46:33] New patchset: Cmjohnson; "Updating mac address for solr1002. H/W change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41963 [16:47:02] New review: Cmjohnson; "looks good to me" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/41963 [16:47:03] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41963 [16:49:36] ^demon: for gerrit? [16:49:44] <^demon> Yeah. [16:49:48] <^demon> I'm working on a BZ plugin [16:49:50] did madcaplaughs get sorted? [16:49:56] ^demon: to show summary/status? [16:51:46] <^demon> madcaplaughs? [16:51:47] <^demon> huh. [16:54:33] !log pulling sfp uplink module on asw-c-eqiad [16:54:42] Logged the message, Mistress of the network gear. [16:55:14] 02 15:24:57 < madcaplaughs> tried on chrome and firefox [16:55:14] 02 15:26:20 < madcaplaughs> now its saying incorrect username and password [16:55:32] ^demon: can't log in to gerrit. but that question wasn't really for you. unless you have some ideas [16:55:57] yeah i could login [16:56:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:56:08] madcaplaughs: so what's the latest then? [16:56:08] thanks, i cleared the cookies, it worked :) [16:56:18] ok, great [16:56:33] he still can't commit though [16:56:39] i am having a minor problem with my git [16:56:42] trying to sort it out [16:56:46] <^demon> Generally "can't login" should be debugged by "Clear your session & cookies" and "Make sure you're using CN not SN" [16:56:49] well, i have the same question then: symptoms? [16:57:12] its saying permission denied (publickey) [16:57:26] <^demon> jeremyb: Well, the api developer is saying he's going to fix it this weekend :) [16:57:28] madcaplaughs: when you do what? [16:57:44] ^demon: erm, SN===CN in this case [16:57:44] when try ssh username [16:58:00] madcaplaughs: err, what's the whole command? [16:58:04] <^demon> jeremyb: I was speaking generally :) [16:58:10] ^demon: yeah :) [16:58:22] ^demon: 02 15:23:53 < jeremyb> madcaplaughs: try clearing all gerrit.wikimedia.org cookies and start again [16:59:06] ssh username@gerrit.wikimedia.org -p 29418 [16:59:19] and where did you submit your key? [17:00:09] i did a add-ssh with the path to my key [17:00:27] er, huh? [17:00:55] what is this add-ssh you speak of? [17:02:03] ssh-add [17:02:10] oh, locally [17:02:20] mutante's alive! [17:02:29] <^demon> Ok, I take back anything bad I said about this library being bad. The author is *awesome* [17:02:47] mutante: freues neues jahr did i butcher that? ;-) [17:02:50] <^demon> He already said he's going to fix my bug by this weekend, and then he e-mailed me to wonder what I was using his library for :) [17:03:04] <^demon> And wanting to know if I had other problems. [17:03:14] jeremyb: thanks. :) s/freues/frohes/g :) [17:03:18] you too [17:03:34] yeah locally [17:03:37] wow, 2 out of 3 [17:04:26] jeremyb: froh = glad (adjective), freuen = to gladden (verb) [17:04:43] * jeremyb also heard glicklickes (sp?) [17:04:46] sorry my bad i meant ssh-add [17:05:02] madcaplaughs: well i'll be back for a short bit in 10ish mins and will see if anyone else has figured it out by then. probably you need to pastebin something like `ssh -vv username@gerrit.wikimedia.org -p 29418 gerrit --help` [17:05:10] * jeremyb runs away [17:05:27] jeremyb: http://en.wiktionary.org/wiki/gl%C3%BCcklich [17:05:45] damn umlauts [17:06:10] hehe, check http://en.wikipedia.org/wiki/Metal_umlaut [17:07:22] ^demon: are you working on gerrit-bugzilla integration? :o [17:07:29] <^demon> Yes. [17:09:20] <^demon> I'm having trouble getting it committed though. [17:10:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [17:10:37] ^demon: you're awesome :-O [17:15:49] <^demon> Nemo_bis: https://gerrit-review.googlesource.com/#/c/40750/ :) [17:16:20] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:16:20] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [17:22:36] hey paravoid [17:22:41] hi [17:22:45] best wishes for 2013! [17:22:52] you too :) [17:23:41] would you maybe have some time to help me and ottomata with deploying some changes to varnish/squid/ log format, in particular we need to add the X_Carrier http header [17:25:13] what do you guys need? [17:25:43] this: [17:25:44] https://gerrit.wikimedia.org/r/#/c/12188/ [17:26:19] I saw that [17:26:23] this has +1 from a number of people [17:27:20] so, what do you need? [17:27:25] yeah, it just needs done, they wanted to wait til after the fundraiser [17:27:37] i don't feel comfortable deploying this one myself [17:27:38] it needs a babysitter [17:27:53] it pushes changes to all varnishes, squids, and nginxes [17:30:50] the changeset is only for varnish & nginx [17:31:00] "only" :-) [17:32:20] right, the squid confs aren't in puppet [17:32:29] the commit message says: [17:32:35] Since the frontend.conf.php squid config template is not checked into [17:32:35] puppet, I cannot include the change to that file as part of this commit. [17:32:35] I've stored the patch in my home directory on fenari: [17:32:35] fenari:/home/otto/frontend.conf.php.accept_language_x_carrier_headers_in_log.patch [17:32:35] This needs to be applied to /home/w/conf/squid/frontend.conf.php. [17:32:56] (I made that patch back in June though, and I don't know if frontend.conf.php has changed since then) [17:35:20] so [17:35:25] how do you want to play this? [17:35:37] want to drive this? I can back you up :) [17:35:55] !log db1014 removing bad hdd from slot11 to replace w/new rt 4039 [17:36:05] Logged the message, Master [17:41:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [17:56:24] hey, sorry, paravoid, yeah sure! [17:56:42] we have a little meeting now, can we do this in 30 mins or an hour? [17:57:07] sure [17:57:19] let's talk again then [17:57:40] New patchset: Reedy; "Update symlinks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41965 [17:58:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41965 [18:01:38] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90?authuser=1 [18:01:42] oops [18:01:45] wrong chat [18:08:14] New patchset: Aaron Schulz; "Captchas back to swift for testwikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41966 [18:08:46] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41966 [18:09:11] lesliecarr: can you check to see if the port is enabled on asw-a7-eqiad 0/32 please (solr1002) [18:09:26] sure [18:09:58] !log aaron synchronized wmf-config/CommonSettings.php [18:10:08] so, you can also check yourself -- if you do "show interface description | match 7/32 " on the switch [18:10:08] Logged the message, Master [18:10:31] or actually 7/0/32 [18:10:49] it's up and in the private vlan [18:12:05] !log reedy synchronized php-1.21wmf7/ 'Initial sync' [18:12:10] New patchset: Bouron; "Added Babel category names for Ossetian Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41967 [18:12:14] Logged the message, Master [18:12:45] Can someone please power cycle srv191? [18:12:45] reedy@fenari:/home/wikipedia/common$ ssh srv191 [18:12:45] ssh_exchange_identification: Connection closed by remote host [18:13:14] reedy: i got it [18:13:24] !log reedy synchronized wmf-config/ [18:13:32] Logged the message, Master [18:13:34] Thanks [18:14:03] !log powercycling srv191 [18:14:12] Logged the message, Master [18:14:44] !log reedy synchronized live-1.5/ [18:14:52] Logged the message, Master [18:16:28] RECOVERY - SSH on srv191 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:16:49] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki to 1.21wmf7 before rebuilding message cache [18:16:58] Logged the message, Master [18:17:17] !log Running sync-common on srv191 [18:17:25] Logged the message, Master [18:23:09] !log aaron synchronized php-1.21wmf6/extensions/ConfirmEdit 'deployed 74c5543fead84a8460be51ffa0b16104cfc3abd1' [18:23:18] Logged the message, Master [18:24:25] Can someone also run this on fenari for me please? chown mwdeploy /home/wikipedia/common/wmf-config/ExtensionMessages-1.21wmf7.php [18:26:15] !log reedy Started syncing Wikimedia installation... : Rebuild message cache for 1.21wmf7 [18:26:23] Logged the message, Master [18:29:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:42] quarterly uptime report shows DNS uptime 100% [18:31:51] i wish this would stop all the vendors from trying to sell me dns services. [18:32:30] robh: i see that mdadm spam ... i was leaving it for paravoid or mark since I know they're working on ms-be1001+ [18:32:46] oh [18:32:47] missed that [18:32:51] sorry [18:32:58] not a real issue thouhg [18:33:59] oh, man, RobH and cmjohnson1 together at the same time! [18:34:08] any word about analytics1007 from Cisco? [18:35:09] ottomata: nothing new...i haven't looked at it much since i got to eqiad. [18:35:29] as in, no word from them at all in over a month? [18:36:08] i did not contact cisco...i don't see in the ticket if robh did or not...(he may be able to answer that) [18:36:55] fatal: Not a git repository: /tmp/new-mw-clone-1579598498/mw/.git/modules/extensions/AbuseFilter [18:36:59] Reedy: what is that on about? [18:37:16] Where are you seeing that? [18:37:28] checkoutmediawiki clones to /tmp then moves to NFS [18:38:14] git pull for wmf7 [18:40:15] I'll wait for scap to finish first [18:40:49] uhhh, contact cisco for....... [18:41:01] hrmm, i contected them for 1004 months ago and got it fixed [18:41:02] bad disk [18:41:09] i dont recall calling about analytics1007 [18:41:13] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [18:41:13] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [18:41:58] RECOVERY - Puppet freshness on srv191 is OK: puppet ran at Wed Jan 2 18:41:23 UTC 2013 [18:42:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [18:43:55] RECOVERY - Apache HTTP on srv191 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.079 second response time [18:53:45] notpeter, any chance you could look at Solr monitoring? [18:59:40] sure [19:03:25] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:03:27] paravoid, wanna? [19:04:19] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [19:04:20] I'll have to go for dinner in about 15-20' [19:04:26] but until then, sure :) [19:05:42] haha, hmm, i don't think 15-20 mins is enough time, i'd like someone around to make sure I don't destroy things [19:07:20] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41917 [19:07:36] paravoid: we have spare 320s in eqiad. [19:07:50] we have 24 or so we removed from old swift servers. [19:09:55] ottomata: okay, how about in an hour/hour and a half? [19:10:10] yeah that's good [19:12:16] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [19:12:16] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [19:12:16] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [19:12:51] binasher: if you roll out the shm_reclen change today, it might be a good opportunity for doing this one too: https://gerrit.wikimedia.org/r/#/c/41942/ (but this one is not urgent). it adds '\?.' to the varnishncsa filter regexp so that it only matches URLs that have a nonempty query string. rationale in the commit msg. [19:15:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:46] famous last words, etc. but i expect these will be the last tweaks we'll need for a long time. [19:15:57] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41942 [19:16:06] ori-l: yeah, that looks good [19:16:17] sweet, thanks [19:17:03] ottomata: isn't that conflicting with your change? [19:17:37] (41942) [19:17:41] Reedy: still have that error [19:17:49] AaronSchulz: I haven't tried to fix it [19:17:51] paravoid: hrm? why would it? [19:17:58] [18:40:15] I'll wait for scap to finish first [19:17:59] Reedy: but scap is done right? [19:18:02] No [19:18:06] ugh [19:18:06] paravoid, dont' think so [19:18:18] its the same file, but it should be a completely different portion [19:18:19] it's onto the last few hosts based on the dsh group [19:20:11] 8 to go [19:20:31] PROBLEM - Host silicon is DOWN: PING CRITICAL - Packet loss = 100% [19:21:31] <-- expected? [19:21:33] !log reedy Finished syncing Wikimedia installation... : Rebuild message cache for 1.21wmf7 [19:21:43] Logged the message, Master [19:22:29] silicon is not down, but it lost network [19:22:39] AaronSchulz: this looks suspicious [19:22:39] worktree = /tmp/new-mw-clone-1579598498/mw/extensions/AbuseFilter [19:22:50] New patchset: Demon; "Remove extension distributor mess from the Apaches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41976 [19:24:13] Reedy: what are you looking at? [19:24:54] AaronSchulz: /home/wikipedia/common/php-1.21wmf7/.git/modules/extensions/AbuseFilter/config [19:25:44] ffs [19:26:05] Things seem to vary between 1.21wmf5, 1.21wmf6 and 1.21wmf7 [19:26:15] When was fenari reinstalled.. [19:26:18] /upgraded [19:26:32] 17th December [19:26:37] !log changing novaadmin password in labvs [19:26:45] !log make that labs. [19:26:47] Logged the message, Master [19:26:55] Logged the message, Master [19:26:58] After wmf6 was checked out etc [19:26:59] grr [19:27:50] Yay, stackoverflow [19:27:51] "Why are you setting worktree at all? By default, the work tree is where you run your commands from, where the .git directory is" [19:27:55] New review: Demon; "Can't be merged until I9d3c328f is merged and deployed, but at least wanted to get this up for review." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/41976 [19:28:40] Reedy: doesn't the submodule mechanism automatically set worktree? [19:28:55] robla: It's not there on the wmf5 and wmf6 checkouts [19:28:59] Which were done with older versions of git [19:29:32] ahhh....ok [19:29:46] reedy@fenari:/home/wikipedia/common/php-1.21wmf7$ git --version [19:29:47] git version 1.7.9.5 [19:30:01] I believe this is fixed in (at least) the most current version of git (1.7.10.1). I can't seem to find a changelog, so I have no idea when it got fixed. I was able to have git fix the issue by deleting both the submodule and the folder in the .git/modules folder and then redoing git submodule init and git submodule update. [19:30:05] We've a crap version of git [19:30:25] RECOVERY - Host silicon is UP: PING OK - Packet loss = 0%, RTA = 27.29 ms [19:30:33] * Reedy attempts to cleanup [19:30:53] <^demon> There's lots of crap versions of git. We really should pin our version of git we install rather than ensure => latest [19:31:05] ensure latest? why??? [19:31:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [19:31:34] * Reedy smiles at paravoid [19:31:49] <^demon> AaronSchulz: I want to add it to base, then we can pin it there :) [19:33:16] presumably, Precise has been using a crap version of Git for months [19:34:56] mmm [19:35:00] * Reedy deletes more stuff [19:35:30] Yup [19:35:34] For anyones info: [19:35:45] Delete rm -r .git/modules/extensions* [19:35:47] rm -rf extensions [19:35:50] git submodule update --init [19:35:55] "fixes" it [19:37:02] Reedy: that's fixing 1.21wmf7? [19:37:09] yeah [19:37:30] I wonder if core.worktree is set.. [19:39:15] I was assuming that "git submodule update --init" was how the tree was created in the first place, so I wonder why it fixes it now [19:40:03] it probably doesn't actually fix it [19:40:09] just sets the working tree to be where it is now [19:40:14] I'll verify when it's finished [19:40:23] ah, ok [19:40:39] Reedy: hmmm? [19:41:08] robla: https://bugzilla.wikimedia.org/show_bug.cgi?id=11057 [19:41:22] robla: that's been in since 2007 and it's a really simple change [19:41:32] any way to get this into the queue? [19:41:33] that bug is too old, I refuse to look at it :-P [19:41:39] :D [19:42:18] 255 chars would likely be ideal [19:42:45] <^demon|lunch> Really only thing we need to do is have binasher perform the change on wmf wikis whenever we want. [19:42:48] robla: yup [19:42:51] worktree = /home/wikipedia/common/php-1.21wmf7/extensions/AbuseFilter [19:42:52] <^demon|lunch> There's nothing preventing us from making the core change. [19:43:03] I already increased it once ;) [19:43:16] what's the size now? [19:43:21] 32 (?) [19:43:27] oh, nm, I see in the comments [19:43:34] <^demon|lunch> Yeah, we should just make it 255. [19:43:50] Ryan_Lane: Don't you want group names OVER 9000!? [19:43:58] I only requested 64 ;) [19:44:07] <^demon|lunch> But anyway, no code changes needed for it. We can go ahead and make the change in master for 3rd parties. [19:44:15] <^demon|lunch> And make the change on wmf wikis whenever we feel like it. [19:44:27] wait.. Ryan_Lane is asking for a schema change across all wmf wikis based on something in active directory. [19:44:30] active directory? [19:44:33] Ryan_Lane: WHO ARE YOU? [19:44:34] :D [19:44:47] binasher: well, to be fair, this was in 2007 [19:44:54] paravoid: We've got a crap version of git in 12.04 it seems :( [19:44:59] <^demon|lunch> We're replacing CentralAuth with ActiveDirectory. [19:45:01] and it's needed for any external auth that syncs groups ;) [19:45:09] <^demon|lunch> Since *anything* would suck less than central auth. [19:45:09] we can do 48, but you'll have to get a majority of the majority to agree [19:45:15] Ryan_Lane: You could've also fdji'd yourself and fixed it :p [19:45:21] Reedy: what do you mean? [19:45:26] Reedy: true [19:45:36] does this matter for wmf at all? or is it more for intellipedia? [19:45:48] I avoid schema changes, generally [19:45:57] <^demon|lunch> binasher: In practical terms, no. Theoretically, it could. [19:45:58] * paravoid guesses it's for labsconsole [19:45:59] binasher: it's for third party users [19:46:11] paravoid: nope. we don't sync groups to mediawiki [19:46:16] paravoid: For some reason it sets an absolute worktree rather than a relative one.. So when we checkout to /tmp, copy that away, and delete the /tmp folder, it gets upset [19:46:29] but, a third party user asked about this [19:46:37] <^demon|lunch> Goforit. [19:46:41] <^demon|lunch> No reason not to. [19:46:44] third party users who want to integrate mediawiki with active directory.. i pitty the admin who has to deal with that! [19:47:06] binasher: it's easy. it's just ldap [19:47:08] binasher: we should probably support access as a backend for mediawiki. [19:47:15] <^demon|lunch> binasher: You're describing the first mediawiki install I managed ;-) [19:47:33] to be fair AD's group schemas are quite standard [19:47:41] and used in other ldap configs too [19:48:12] third party user == intellipedia [19:48:22] well, they are one of them :D [19:48:27] !log aaron synchronized php-1.21wmf7/extensions/ConfirmEdit 'deployed 8fddb4a805ee3e79ef07ef3ca78a7ae7df5f4bba' [19:48:28] ^demon|lunch: you went from that to being paid to work on this thing? bwahahah [19:48:31] Ryan_Lane: btw, I see lots of salt-minion dmesg messages again [19:48:33] intellipedia doesn't use the ldap extension, though [19:48:36] Logged the message, Master [19:48:36] exit code 42 or something [19:48:43] paravoid: since the upgrade? [19:49:00] hah, i knew i'd get Ryan_Lane to spill something classified. so they don't use ldap! [19:49:14] I'm sure they probably use ldap [19:49:15] :D [19:49:15] yes [19:49:24] notpeter: not Access..but .. http://www.mediawiki.org/wiki/Extension:MSSQLBackCompat :) [19:49:28] code 42? [19:49:31] ^demon|lunch, when you get back from lunch, was SVN marked as read-only per http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/064024.html ? [19:49:37] [184561.062848] init: salt-minion main process (18253) terminated with status 42 [19:49:37] <^demon|lunch> No. [19:49:39] <^demon|lunch> I was asked not to. [19:49:39] [187336.251528] init: salt-minion main process (19138) terminated with status 42 [19:49:43] [190345.439739] init: salt-minion main process (20047) terminated with status 42 [19:49:46] * Ryan_Lane nods [19:49:53] <^demon|lunch> "People might want to still use svn." [19:50:02] <^demon|lunch> Which isn't really an excuse, but I was busy and didn't fight it. [19:50:28] ^demon|lunch: ok, that makes me NOT delete the MediaWiki-CodeReview list then [19:50:35] paravoid: ok. I'm going to open a new issue with salt for that [19:50:46] may just need to add it into the init script [19:50:48] <^demon|lunch> mutante: Go ahead and delete the list. I don't believe anyone's actually using SVN anymore. [19:50:58] ^demon|lunch: Ryan_Lane: schema change for user_group.ug_group is ok with me [19:51:01] <^demon|lunch> And the 1 or 2 who are, aren't using CodeReview. [19:51:02] Thehelpfulone: also, should it not just receive git mail instead or something? [19:51:07] binasher: cool [19:51:16] RECOVERY - Host analytics1007 is UP: PING WARNING - Packet loss = 44%, RTA = 56.34 ms [19:51:25] ^demon|lunch: ok, i guess its a good way to find out:) [19:51:26] <^demon|lunch> mutante: No, we have mediawiki-commits that gets all the gerrit spam. [19:51:27] I think mediawiki-commits has that mutante [19:51:39] <^demon|lunch> mediawiki-CodeReview can be retired 3 months ago. [19:51:40] yea, i remember the issues with that we had in the beginning [19:51:45] ok [19:52:20] <^demon|lunch> Every last tiny bit of svn that can be killed, KILL WITH FIRE. [19:53:13] !log reedy synchronized php-1.21wmf7/extensions/ [19:53:22] Logged the message, Master [19:53:25] Thehelpfulone: <-- "Pywikipedia-svn" [19:53:52] heh, does that one have archives? I think Pywikipedia is the only group still using SVN [19:54:22] yes, it does have archives [19:54:29] April 2009 'til now [19:54:29] from the pywikipedia mailing list I think they're considering moving to github and then in the future to gerrit, or moving directly to git/gerrit so that they're still close to the other WMF stuff [19:54:45] why github..sigh [19:54:52] directly,yay [19:55:16] see thread starting at http://lists.wikimedia.org/pipermail/pywikipedia-l/2012-December/007657.html [19:56:51] <^demon|lunch> Hmm, speaking of CodeReview, we should disable that for everyone and make it read-only. [19:57:18] oh actually it looks like maybe then do want to go direct [19:57:22] Reedy: other than modifying tables.sql, is there something else I need to do for this change [19:57:23] ? [19:57:32] ideally create patches [19:57:37] MaxSem: so, the check needs to be something more like [19:57:38] /usr/lib/nagios/plugins/check_http -H solr1001 -I 10.64.0.198 -p 8983 -u "solr/select/?q=*%3A*&start=0&rows=1&indent=on" [19:57:47] but that's returning a 404 [19:57:48] Or update the old ones that I made before, and possibly update the updater lines [19:57:53] so I think that something is being escaped wrong [19:57:55] Thehelpfulone: http://phenoelit.org/blog/archives/2012/12/21/let_me_github_that_for_you/index.html [19:57:58] still looking at it [19:59:05] Thehelpfulone: http://www.drupal4hu.com/node/242 [19:59:50] mutante, heh you don't need to convince me, but dropping those links on the mailing list would be a good idea to help steer them in the right direction ;-) [20:00:38] New patchset: Demon; "Revoke all permissions to CodeReview actions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41978 [20:02:51] Thehelpfulone: right.. if only it would not mean having to subscribe, post and unsubscribe.. dont want that whole list [20:03:09] lemme delete that codereview list now [20:03:12] bah okay I'll do it :P [20:03:15] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki, wikidatawiki and mediawikiwiki to 1.21wmf7 [20:03:23] Thehelpfulone: :) [20:03:23] Logged the message, Master [20:03:27] Reedy: can I just modify patch-ug_group-length-increase.sql? [20:04:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:12] Ryan_Lane: And the ufg_group too [20:05:19] yep. seeing that [20:05:27] Ryan_Lane: If I were you, I'd rename the files slightly too and update MySqlUpdater.php etc to match [20:05:36] -_- [20:05:42] I'll just let someone else do this :) [20:05:49] lol [20:05:54] Then you don't have to mess around with fixing the update row thing [20:06:07] I need to modify it for every damn database, too, eh? [20:06:25] Not the files.. [20:06:39] sqlite mostly uses the same files [20:06:43] I'll make the changes in a few minutes [20:06:46] shouldn't take long [20:06:49] thanks [20:07:22] why is our database stuff in maintenance/ ? :) [20:07:39] legacy reasons i guess [20:07:43] <^demon|lunch> maintenance/archives, at that. [20:07:44] * Ryan_Lane nods [20:07:46] yes [20:07:48] <^demon|lunch> What a wonderful name. [20:07:52] why is mediawiki called mediawiki but has no video support in core and only recently in extensions? [20:08:01] you'd think we'd split that into schema/mysql schema/sqlite, etc [20:08:15] and have patches/ directories rather than archives/ [20:08:16] <^demon|lunch> AaronSchulz: Because "TextWiki" was lame sounding. [20:08:16] AaronSchulz: irony [20:08:58] ^demon|lunch, being a crat on MediaWiki.org, did you want to remove the user right from all of https://www.mediawiki.org/w/index.php?title=Special:ListUsers&offset=&limit=500&group=coder too? :P [20:10:06] <^demon|lunch> Lazzzzyyyyy [20:16:39] hmm I can't find a script to do it, if you give me +crat and bot I'll do it manually and let you know when it's done so you can remove it completely or I can poke some other hapless crat :-) [20:16:46] Thehelpfulone: hahaha https://lists.wikimedia.org/mailman/private/ops-related/2012-November/000031.html [20:17:00] Thehelpfulone: what right? [20:17:01] it got spammed ...a bit [20:17:08] the list is private, was that my post? [20:17:17] Nemo_bis, "coder" per https://gerrit.wikimedia.org/r/#/c/41978/1/wmf-config/CommonSettings.php [20:17:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.798 seconds [20:17:26] coders* [20:17:40] oops, no.. it was from "Administrator Julia" in the Ukraine.. "Here are new ladies profiles for you" :o [20:17:49] haha [20:18:26] our lists are set to not accept posts from non-members by default, so either andrew changed that or he let them through the moderation queue ;-) [20:18:39] Thehelpfulone: but those are not the only rights in that group [20:18:59] Nemo_bis, all the other ones are included in other groups, are they not? [20:19:13] Thehelpfulone: so what [20:19:35] <^demon|lunch> We don't need to delete the coder group. [20:19:48] <^demon|lunch> A) There's other rights on it, and B) It's not hurting anything. [20:19:57] <^demon|lunch> I'm just wanting to lock down the CR tool, which my change does. [20:20:23] +1 [20:21:05] Ryan_Lane: https://gerrit.wikimedia.org/r/41983 [20:21:10] oh oops, I thought that editors/reviews was assignable by users but I'm thinking testwiki [20:21:50] Reedy: most of those still show 32? [20:21:57] https://gerrit.wikimedia.org/r/#/c/41983/1/maintenance/archives/patch-ufg_group-length-increase-255.sql,unified [20:22:11] Yeah, it was recording it as a delete and add, so I reset it to make it do a move [20:22:17] and forgot to change the mysql ones again [20:22:18] Reedy: we need wikidata switched back to wmf6 [20:22:20] heh [20:22:34] some reason all wikidata items are labeled as "english" [20:22:42] A maintenance/archives/patch-ufg_group-length-increase-255.sql 2 lines Side-by-Side Unified [20:22:42] D maintenance/archives/patch-ufg_group-length-increase.sql Side-by-Side Unified [20:22:43] A maintenance/archives/patch-ug_group-length-increase-255.sql 2 lines Side-by-Side Unified [20:22:43] D maintenance/archives/patch-ug_group-length-increase.sql Side-by-Side Unified [20:22:46] Meh, it's done it again anyway [20:22:48] stupid thing [20:24:09] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikidatawiki back to 1.21wmf6 [20:24:17] Logged the message, Master [20:24:19] Reedy: thanks :) [20:24:40] New patchset: Reedy; "testwiki, test2wiki and mediawikiwiki to 1.21wmf7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41984 [20:26:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41984 [20:29:25] !log Modified user_former_groups.ufg_group to varbinary(255) [20:29:34] Logged the message, Master [20:30:46] ottomata: ping? [20:31:52] pnoooooonnnng [20:31:55] yeah let's do it [20:32:11] shoudl we do one at a time? [20:33:03] !log Modified user_groups.ug_group to varbinary(255) [20:33:12] Logged the message, Master [20:34:13] Reedy: wasn't it going to 64? [20:34:53] whatevs [20:35:13] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:35:13] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [20:37:31] ottomata: one of what? [20:37:50] binasher: lol. Ryan said 255 on the bug, but I think robla was going with 64 for trolling reasons [20:38:05] Ryan_Lane: I'd fix labsconsole for you, but I don't have access :p [20:39:11] the 3 services that are getting changes [20:39:15] nginx, varnish, squid [20:39:22] maybe in that order? [20:40:08] nginx and varnish are a single commit [20:41:03] hm, yea, but I guess I meant babysitting them [20:41:11] running puppet on an nginx, etc. [20:41:24] i haven't done any puppet deployments with large numbers of machines before though [20:41:35] well [20:41:36] is it a pain to stop puppet on all of them while we check? [20:41:40] yes :) [20:41:42] aye [20:42:07] well, i mean, i've tested this on nginx, squid, and varnish instances of my own [20:42:26] oh you have? [20:42:28] even better [20:43:10] yeah, log1.pmtpa.wmflabs has all these changes + a mediawiki instance running, and even some python unit tests to hammer the different frontends and check the resulting log output [20:43:11] I'd say just push it [20:43:19] and force run puppet on a few of them to check that all is well [20:43:27] seems safe enough [20:43:32] Reedy: robla just curious how much a hassle or how easy would it be to have a test wiki (like test2) for wikidata? [20:43:59] (famous last words) [20:44:00] hmm, not exactly sure how to check though, i mean, i guess the first check is that the services start back up ok :p [20:44:19] i guess i can grep udp2log output for a node name [20:44:25] and we can send some requests there to see if the log format changes [20:44:30] yeah, I was hoping that you could check on the receiving end [20:44:48] ok, i'll push, can you give me a squid, nginx, and varnish frontend node name that I can grep for? [20:44:49] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Wed Jan 2 20:44:24 UTC 2013 [20:44:54] I think we have enough real requests [20:45:03] hm, yeah true ok [20:45:14] pick one, force ran puppet there and see [20:45:25] yeah, but what are they (i've never logged into a front end server) [20:45:35] nginx is easy, it's ssl1-4, ssl1001-1004 and ssl3001-3004 [20:45:41] try eqiad [20:45:46] e.g. ssl1001 [20:46:14] varnish, well, look at site.pp :) [20:47:41] aude: technically fairly easily [20:48:00] Reedy: so, i can't reproduce [20:48:02] http://www.wikidata.org/wiki/Wikidata:Project_chat#Bug_Found [20:48:22] someone says it happened before, so i just wonder if it has to do with localisation cache or something [20:48:57] * aude thinks we'll have a tough time debugging this on our test instances and has somethign to do with deployment / configurations [20:49:40] paravoid, role::cache::upload [20:49:40] ? [20:49:42] for varnish? [20:49:45] the comment says: [20:49:46] # sq79-86 are upload squids [20:49:50] buuuut, maybe it means varnish? [20:50:12] seeing as it looks like role::cache::upload sets up varnish [20:51:07] role::cache::upload does both iirc [20:51:19] it's varnish for eqiad, squid for pmtpa [20:51:21] esams is both [20:51:32] squid in production, varnish in testing (again, iirc) [20:51:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:51:56] class upload { [20:51:56] # FIXME: remove this hack [20:51:56] if $::site == "eqiad" or ($::site == "esams" and $::hostname =~ /^cp30/) { [20:52:02] ah hm [20:52:02] right [20:52:04] ^^^^ binasher: yes, I was totally trolling back there [20:52:05] Change abandoned: Demon; "Will do this later." [operations/gerrit/plugins] (master) - https://gerrit.wikimedia.org/r/39580 [20:52:24] aude: isn't the wikidata client already running on test2? [20:52:29] ok, paravoid, so: [20:52:42] node /^cp10(2[1-9]|3[0-6])\.eqiad\.wmnet$/ { [20:52:50] eqiad varnish [20:52:55] nginx: ssl1001.wikimedia.org [20:52:55] varnisih: sq79.wikimedia.org [20:52:55] squid: cp1001.eqiad.wmnet [20:52:55] or you can check on the mobile varnish too [20:52:57] (we can move to #wikimedia-tech for this conversation, btw) [20:53:09] hm [20:53:26] hm ok, so cp1021 for varnish? [20:53:41] yeah [20:53:46] hokay [20:54:04] robla: yes but i'm talking about the repo [20:54:04] cp1041-4 are the mobile ones [20:54:11] I'm presuming X-Carrier matters more there [20:55:04] we can't really have both the client and repo extensions on the same wiki [20:56:18] yeah true [20:56:28] that's varnish? ok maybe i'll check 1041 for that then [20:56:37] yeah, mobile runs on varnish exclusively [20:56:44] separate cluster too [20:56:57] which only lives in eqiad [20:57:28] ok cool [20:57:33] going to merge that change in now [20:57:38] sure [20:58:04] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12188 [20:59:34] ok, running puppet on cp1041 [21:04:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.868 seconds [21:05:52] puppet takes a long time to ruunnnnn [21:09:16] ottomata: Is it quicker or slower than scap? [21:09:16] :p [21:10:08] hehe, hmm, paravoid, puppet didn't make my change on cp1041 [21:10:24] did you merge on sockpuppet? [21:10:48] yes [21:10:51] lemme double check [21:11:20] ja, its there [21:15:07] ottomata: it's there but it's not on stafford [21:15:12] you probably forgot to forward your agent [21:15:20] ahh, no i know [21:15:22] yeah i just saw that too [21:15:25] just fixed that [21:15:26] no [21:15:31] when I merged on sockpuppoet [21:15:53] i used ctrl-r to bring up the fetch command, and someone had previously done it before me with the —no-ff flag [21:16:12] i hit enter, but then saw that flag last second and hit ctrl-c (because I wasn't used to seeing that flag) [21:16:23] but I think the merge completed, but didn't execute the hook or whatever it does to update stafford [21:16:30] so, I just fetch + merged on stafford [21:16:32] looks ok now [21:16:40] yeahhhhh! [21:16:55] New review: Pgehres; "Fundraising still has a number of critical things that we maintain in the Wikimedia repo. Now that ..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/41978 [21:17:09] hmmm, i see the changes in the logs too! [21:17:10] cool! [21:17:16] yeah! [21:17:17] cool! [21:17:35] ok, running puppet on ssl1001 [21:18:05] New review: Chad Horohoe; "I totally forgot about the Wikimedia repo (and didn't realize you guys used CodeReview at all). No p..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/41978 [21:19:27] New review: Pgehres; "No worries. We do want to dump svn like a ton of bricks, so we will be making the effort to move ev..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/41978 [21:19:46] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:53] ha hm [21:19:54] ottomata: ^ [21:20:14] uh oh! [21:21:05] hmmm, my change isn't on cp1021 yet [21:21:12] yeah, maybe it's unrelated [21:21:25] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [21:21:47] hm, i'm not getting any logs from ssl1001, ideas on how I can send it a request it will answer? [21:21:50] (looking at configs now) [21:23:58] PROBLEM - Varnish HTCP daemon on cp1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:25:37] RECOVERY - Varnish HTCP daemon on cp1021 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [21:26:16] oh! [21:26:23] i know why im' not getting any requests [21:26:28] nginx logs aren't going to oxygen! [21:26:37] drdee, should they be? [21:27:51] huhhhh????/ [21:28:00] that's ssl traffic, right? [21:28:03] yes [21:28:15] I remember multiple issues with that, not sure if that got fixed in the meantime [21:28:20] i am pretty sure i have seen ssl traffic logs in udp2log [21:28:38] I'm pretty sure sequence numbers didn't work [21:28:46] not sure if someone fixed it in the meantime [21:28:55] yes [21:28:58] they are going to emery and locke [21:29:00] but not oxygen [21:29:07] ohhhhhh [21:29:08] and oxygen is the multicast relay [21:29:16] yeah, the seq numbers still don't work [21:29:22] why is that? [21:30:05] they never have [21:30:12] multithreaded worker stuff in nginx [21:30:17] but [21:30:21] that doesn't stop us from sending the logs [21:30:29] well, not really multithreaded related anymore [21:30:30] just stops us from using seq numbers to measure packet loss from nginx nodes [21:30:32] just buggy [21:30:34] oh [21:30:36] hm dunno then [21:30:37] but anyway [21:30:40] there's this too [21:30:49] doesn't nginx just proxy to squid anyway? [21:30:50] or varnish? [21:30:55] yes [21:31:04] so if we were logging nginx reqs then we'd have duplicates in the logs [21:31:11] if you don't detect that, yes [21:31:21] afaik no one is thinking bout it [21:31:24] drdee? [21:31:28] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:31:31] i know we've sorta talked about that before [21:31:55] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:32:02] we definitely have talked about this :) [21:32:04] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:32:22] PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:32:24] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:32:24] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:32:28] i know about the sequence number problem, but that's independent from sending traffic to oxygen [21:32:31] PROBLEM - HTTPS on ssl3002 is CRITICAL: Connection refused [21:32:35] right [21:32:37] totally [21:32:40] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:32:41] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:32:44] uh oh [21:32:45] revert [21:32:48] those me? [21:32:49] ok [21:32:49] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:32:49] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:32:49] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:32:50] yes [21:32:58] PROBLEM - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [21:33:03] just nginx or varnish too? [21:33:07] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: Connection refused [21:33:16] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64705 bytes in 0.704 seconds [21:33:29] nginx was dead on ssl3002 [21:33:34] RECOVERY - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64708 bytes in 0.692 seconds [21:33:35] started it, seems to run now [21:33:37] does it start back up as is? [21:33:38] so hold off that revert [21:33:40] ok [21:33:45] it was fine on ssl1001, ok [21:33:52] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64708 bytes in 0.564 seconds [21:34:01] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 774 bytes in 0.514 seconds [21:34:01] the lb.esams thing is unrelated though, right? [21:34:02] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64705 bytes in 0.690 seconds [21:34:10] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64708 bytes in 0.797 seconds [21:34:10] RECOVERY - HTTPS on ssl3002 is OK: OK - Certificate will expire on 08/22/2015 22:23. [21:34:11] no it isn't [21:34:18] oh i see, it uses that nginx instance [21:34:18] hm [21:34:19] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 3916 bytes in 0.454 seconds [21:34:25] hm [21:34:28] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64708 bytes in 0.709 seconds [21:34:28] or something [21:34:28] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64708 bytes in 0.696 seconds [21:34:28] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64712 bytes in 0.689 seconds [21:34:28] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 93373 bytes in 0.794 seconds [21:34:33] !log starting nginx on ssl3002 [21:34:37] RECOVERY - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64708 bytes in 0.687 seconds [21:34:42] Logged the message, Master [21:34:44] Ryan_Lane: we're still two SSL boxes down on esams? [21:34:55] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 77771 bytes in 0.796 seconds [21:35:15] thank you paravoid [21:35:18] paravoid: yes [21:35:25] let me see if I can get into their console now [21:35:43] ok, well, anyway, I see the logs coming in on emery from other ssl hosts [21:35:45] and they look good [21:35:56] so now, squid? [21:35:57] we lost one ssl box and the site went down? [21:36:10] that doesn't sound right [21:36:21] what do you mean? [21:36:31] sh had spence routed to ssl3002 [21:36:35] ahhhhh [21:36:36] right [21:36:44] sorry. that sounds perfectly right then :D [21:36:55] hm [21:36:55] always forget about that [21:36:57] thinking about it [21:37:03] I'm not even sure that pybal would depool ssl3002 [21:37:10] it should [21:37:10] 3/4 is probably over the threshold [21:37:12] oh [21:37:24] let's check [21:37:28] indeed [21:37:41] we need to be able to run on a single node if necessary [21:38:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:38:22] paravoid, the last bit is to change the squid frontend.conf.php template on fenari, and then some magic that I don't know about to deploy the changes [21:39:00] 2013-01-02 21:24:19.922583 [uploadlb6] Could not depool server ssl3002.esams.wikimedia.org because of too many down! [21:39:06] boooooo [21:39:13] thought so [21:39:52] !log powercycled ssl3001 [21:39:56] so, yeah, we lost 50% of the site because one box went down [21:40:01] Logged the message, Master [21:40:11] well, not really 50% [21:40:12] for https and ipv6 [21:40:14] ah [21:40:14] right [21:40:23] so 20% of 50%? [21:40:31] heh [21:40:32] probably much less [21:40:35] yeah [21:40:44] but, that's a situation that needs to be resolved [21:40:54] thankfully I can get into the mgmt consoles now [21:41:04] oh yay [21:41:15] so, I'm bringing 3001 up [21:41:19] and hopefully 3004 [21:41:32] i thought 3004 had a broken hw? [21:41:47] The disk drive for /var/log/nginx is not ready yet or not present. [21:41:53] that's on 3001 [21:41:55] weird [21:42:06] well, 3004 was down and we couldn't get into the mgmt console [21:42:08] it may be fine [21:42:22] I have a weird recollection of mounting a separate filesystem for /var/log/nginx [21:42:48] when we deployed ipv6 and had access logging and we lost all the whole cluster due to their disk being full [21:42:51] yeah [21:43:01] not sure why I'd mount an lv instead of just deleting logs though [21:43:09] it's really vague now :) [21:43:13] :D [21:43:16] could be my fault though [21:43:20] it was filling it up very quickly [21:43:26] like per day [21:43:31] until I turned the access logs off [21:43:42] so an lv does make some sense there [21:43:53] I'm removing it from the fstab [21:44:39] nod [21:45:52] ottomata: you can't complain, it'd be boring without some excitement [21:46:17] haha [21:46:27] yeehaw [21:46:44] do you know about how squid confs get deployed (since it isn't in puppet)? [21:46:45] ottomata: always "diff -u" btw [21:46:52] yes, I've done it multiple times [21:47:40] puppet is getting worse and worse [21:47:46] I can't even get it to run on most of the systems [21:47:46] yes :( [21:48:03] we need to debug it at some point [21:48:05] virt0 said it was administratively disabled today, too [21:48:07] which is weird [21:48:16] not sure who did that [21:48:57] so, should I make the change to frontend.conf.php and let you deploy, paravoid? [21:49:56] erm [21:49:58] what sets X-Carrier? [21:50:22] RECOVERY - Puppet freshness on ssl3001 is OK: puppet ran at Wed Jan 2 21:50:09 UTC 2013 [21:50:29] paravoid: i think templates/varnish/mobile-frontend.inc.vcl.erb [21:50:49] sounds right, most of the logs won't ahve it [21:50:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.038 seconds [21:51:09] mobile doesn't ever go through squid [21:51:17] why add that to the squid conf? [21:51:52] to be consistent, the logs all need to ahve hte same number of fields [21:52:39] er [21:52:47] sounds a bit silly [21:52:51] but I guess okay [21:52:51] haha [21:53:20] can't say I disagree with you, but i'm sure erik z's scripts would not be happy if things weren't consistent [21:53:31] ok, the fields are now in frontend.conf.phhp [21:53:38] deploy at will, maybe just to one server first if you can [21:53:39] I know, I did that [21:53:49] ahh ok good, i had thought maybe i had and forgot that I had [21:53:53] heheh [21:54:00] it's under git [21:54:00] can you deploy to just one server? cp1001 maybe? [21:54:03] not gerrit though [21:54:05] oh it is? [21:54:13] yes, it was rcs [21:54:19] and I moved it to git :-) [21:54:23] ah nice [21:54:31] well that's good [21:55:26] oh crap [21:55:31] ha, uh oh [21:56:13] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [21:56:13] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [21:56:13] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [21:56:13] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [21:56:13] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [21:56:13] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [21:56:21] sigh [21:56:30] what's up? [21:56:56] the deploy script runs /usr/sbin/squid -k parse to verify that the config is sane [21:57:04] fenari was upgraded to precise [21:57:16] there's no /usr/sbin/squid anymore, since there's no squid 2.x anymore [21:57:26] and squid3's config is of course different [21:57:43] ah yay! [21:57:54] so deploy script doesn't work on fenari anymore? [21:59:34] New patchset: Pyoungmeister; "fixup for solr monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42030 [22:01:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42030 [22:03:19] !log reprepro copysrc precise-wikimedia lucid-wikimedia squid [22:03:29] Logged the message, Master [22:03:38] !log downgrading squid from 3.1 to 2.7 on fenari [22:03:46] Logged the message, Master [22:05:26] ok. ssl3001 is back up [22:05:40] our bastion hosts are weird [22:05:40] RECOVERY - HTTPS on ssl3001 is OK: OK - Certificate will expire on 07/19/2016 16:14. [22:05:55] what the hell do we need libdirectfb-dev for in a bastion host? [22:05:59] alas, I still can't ssh into 3004's mgmt [22:06:18] binasher: libmysqlclient-dev : Depends: libmysqlclient18 (= 5.5.28-0ubuntu0.12.04.3) but 5.5.28-mariadb-wmf201212041~precise is to be installed [22:06:52] binasher: I don't think fenari should need libmysqlclient-dev for anything, so I'm just going to remove that, but it might be a more general problem [22:09:39] texlive on fenari, oh dear [22:09:44] okay I'm going to stop digging [22:10:06] ottomata: deployed on cp1001 [22:10:29] yeah cool! [22:10:30] i see it too! [22:10:31] yay! [22:10:32] it works [22:12:23] New patchset: Cmjohnson; "Removing pappas and te from production dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42035 [22:13:11] New review: Cmjohnson; "Looks good to me" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42035 [22:13:29] !log deploying squid config for Accept-Language/X-Carrier, RT 2745 & 3158 [22:13:38] Logged the message, Master [22:13:40] ottomata: deployed everywhere [22:13:44] and commited [22:13:54] New review: Bouron; "Obvious minor edits. Just new array items." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/41967 [22:14:57] woohoo [22:16:16] I still don't see the point of logging what random people on the internet would set as X-Carrier [22:16:46] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [22:16:46] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [22:16:46] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:16:46] PROBLEM - Puppet freshness on analytics1005 is CRITICAL: Puppet has not run in the last 10 hours [22:16:46] PROBLEM - Puppet freshness on analytics1004 is CRITICAL: Puppet has not run in the last 10 hours [22:16:46] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [22:16:53] i.e. why it's part of frontend config [22:16:55] but oh well [22:17:21] because erik zachte has tons of scripts that parse these logs [22:17:28] and if they aren't consistent, his scripts don't work [22:17:28] ottomata: are you going to resolve those tickets? [22:17:38] oh yeah, I can do that [22:17:49] great [22:17:52] anything else that I can do? [22:18:19] thanks paravoid for helping us out! [22:18:30] no problem [22:18:34] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [22:18:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [22:18:34] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:18:35] PROBLEM - Puppet freshness on analytics1004 is CRITICAL: Puppet has not run in the last 10 hours [22:18:35] PROBLEM - Puppet freshness on analytics1006 is CRITICAL: Puppet has not run in the last 10 hours [22:19:02] binasher: shall I resolve https://rt.wikimedia.org/Ticket/Display.html?id=2635 ? [22:19:46] PROBLEM - HTTPS on ssl3003 is CRITICAL: Connection refused [22:19:47] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is CRITICAL: Connection refused [22:20:04] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: Connection refused [22:20:04] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: Connection refused [22:20:07] oh crap [22:20:13] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: Connection refused [22:20:13] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [22:20:13] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [22:20:14] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:20:14] PROBLEM - Puppet freshness on analytics1004 is CRITICAL: Puppet has not run in the last 10 hours [22:20:19] !log starting nginx on ssl3003 [22:20:28] Logged the message, Master [22:20:49] paravoid - thank you [22:21:00] haha [22:21:24] love how leslie thanks me every time :) [22:21:29] paravoid: what's happening on those boxes? [22:21:34] RECOVERY - HTTPS on ssl3003 is OK: OK - Certificate will expire on 07/19/2016 16:13. [22:21:35] RECOVERY - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.450 second response time [22:21:36] a new config is being pushed [22:21:40] ahhhhhh [22:21:41] right [22:21:46] puppet is trying to refresh [22:21:52] and somehow ends up killing nginx [22:21:52] would be good to know why they hang when reload is used [22:21:53] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 51007 bytes in 0.671 seconds [22:21:53] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 47134 bytes in 0.667 seconds [22:22:01] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [22:22:02] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [22:22:02] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:22:02] PROBLEM - Puppet freshness on analytics1004 is CRITICAL: Puppet has not run in the last 10 hours [22:22:02] PROBLEM - Puppet freshness on analytics1006 is CRITICAL: Puppet has not run in the last 10 hours [22:22:02] restart works, reload hangs [22:22:20] dude why does nagios keep telling us about puppet freshness [22:22:24] maybe we should force puppet to do a restart for config changes on those [22:22:36] LeslieCarr: because it's still stale :) [22:22:59] excess flood for just 5 messages? [22:23:06] did the thresholds get lowered or something? [22:23:49] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [22:23:50] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [22:23:50] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:23:50] PROBLEM - Puppet freshness on analytics1004 is CRITICAL: Puppet has not run in the last 10 hours [22:23:50] PROBLEM - Puppet freshness on analytics1006 is CRITICAL: Puppet has not run in the last 10 hours [22:24:43] no [22:24:46] maybe it's tryingto do more [22:24:59] yeah, but it's sending the notification constantly instead of every 10 hours [22:25:15] maybe it notices that it's failed to do so because it was kicked out? [22:25:18] and trying to repeat it? [22:26:10] it's getting late over here and I've been working straight for hours, I don't think I have the courage to debug irc bots :) [22:26:12] any takers? [22:26:37] you are totally free to go to sleep :) [22:27:37] ottomata: hey - is the analytics puppet expected ? [22:27:41] ottomata: do you expect puppet to be stopped/broken on all analytics hosts or just one? [22:27:50] naw, not expected, lemme check on it [22:29:02] ok [22:29:18] well, to be fair puppet is currently slightly fucked [22:29:20] that's the technical term [22:29:22] ;) [22:29:29] stafford is cpu pegged like crazy [22:30:06] ugh [22:30:14] yeah, [22:30:22] according to an03, puppet agent run is currently happening [22:30:27] i guess its just ahnging [22:30:35] will restart puppet agent on those machines [22:30:49] oh it's not really the agent i think [22:30:59] just check out htop on stafford [22:31:00] ;) [22:31:46] it's the ruby flamethrower! [22:32:36] haha, yeahhhhhh [22:32:40] it is overloaded fo sho [22:33:14] how many machines is that one puppetmaster serving? [22:33:57] all of them? [22:34:12] (i have no idea how many all is :p ) [22:34:19] oh [22:34:49] nor do i, really. a few hundred i think? [22:35:00] * Jeff_Green hangs head in shame [22:39:09] New patchset: Reedy; "Remove old pmtpa-dumpX.dblist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42038 [22:39:33] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42038 [22:42:24] paravoid, btw, thanks for your help today! [22:42:26] i'm out for the eve [22:42:28] laataas [22:44:16] New patchset: Reedy; "Remove some more old symlinks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42039 [22:47:09] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42039 [23:01:39] Reedy, are you still deploying? [23:01:54] No [23:02:17] Not for 2 and a half hours or so [23:03:11] :) [23:04:02] MaxSem: I hopefully have monitoring working for solr [23:04:08] whee [23:04:11] I'm waiting for puppet to run on spence [23:04:13] which takes a while [23:04:27] here's my patch set: https://gerrit.wikimedia.org/r/#/c/42030/ [23:04:34] in case you want to read it over [23:05:20] New patchset: Ryan Lane; "Switch to using the deploy_redis returner" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42042 [23:05:28] notpeter, awesome, thank you so much [23:05:52] New patchset: Pyoungmeister; "temp assigning db61 to es1 shard for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42043 [23:07:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42042 [23:13:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42043 [23:20:27] !log maxsem synchronized php-1.21wmf7/extensions/MobileFrontend 'Revert MobileFrontend to the same revision as 1.21wmf6' [23:20:36] Logged the message, Master