[00:05:44] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [00:07:05] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.160 seconds [00:13:13] New review: Aaron Schulz; "Can this be merged and improvements done later?" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/7823 [01:12:38] PROBLEM - Puppet freshness on db1042 is CRITICAL: Puppet has not run in the last 10 hours [01:32:44] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [01:32:44] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [01:32:44] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [01:41:35] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 262 seconds [01:45:56] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:17:26] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [02:37:41] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [02:41:35] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.162 seconds [02:46:41] RECOVERY - Puppet freshness on searchidx2 is OK: puppet ran at Wed Jun 6 02:46:12 UTC 2012 [02:47:16] what's going on? [02:49:14] RECOVERY - Puppet freshness on db1042 is OK: puppet ran at Wed Jun 6 02:48:49 UTC 2012 [02:54:45] Jasper_Deng: sleep? [02:54:48] i assume [02:55:01] sleep of servers? [02:55:03] maybe leslie excepted [02:55:25] sleep of europeans and aliens visiting europe [02:55:26] some particular squid and db servers have been complaining this whole afternoon [02:55:37] yeah [02:56:00] that's probably fine [02:56:24] cp100[12] was mentioned earlier. idk if it was investigated but it's known and not new this evening [02:56:33] the db's don't look important [02:56:54] (certainly someone would have complained if they were masters) [02:57:11] yeah, I was just wondering. [02:57:47] k. my mind reading skills are out of practice [03:08:53] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [03:26:17] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.141 seconds [04:23:08] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [04:42:38] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [04:44:44] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [04:45:56] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.160 seconds [04:58:59] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.113 seconds [05:07:41] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [05:34:41] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:10:05] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [06:14:26] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [06:18:47] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.107 seconds [06:59:59] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27398 bytes in 0.136 seconds [07:04:08] good morning! [07:04:53] morning [07:12:13] morning [07:17:58] New patchset: Mark Bergsma; "Add all remaining IPv6 LVS service IPs and services to the LVS balancers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10399 [07:18:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10399 [07:21:37] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10399 [07:21:39] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10399 [07:37:51] Logged the message, Master [07:37:55] Logged the message, Master [07:37:59] Logged the message, Master [07:48:26] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [07:53:05] New patchset: Raimond Spekking; "Bug 37365: Install Narayam in Gujarati Wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10401 [07:53:12] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10401 [07:53:30] !log Converted geoiplookup.wikimedia.org into a separate, IPv4-only geodns record [07:53:34] Logged the message, Master [07:59:50] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [08:01:29] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.176 seconds [08:02:22] ugh [08:12:40] New patchset: Mark Bergsma; "Add first IPv6 LVS service monitoring to Nagios for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10402 [08:13:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10402 [08:13:14] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10402 [08:13:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10402 [08:16:47] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [08:21:08] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.114 seconds [08:22:56] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [08:26:50] RECOVERY - mysqld processes on db1042 is OK: PROCS OK: 1 process with command name mysqld [08:30:17] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 41120 seconds [08:32:50] PROBLEM - MySQL Slave Delay on db1042 is CRITICAL: CRIT replication delay 40925 seconds [08:41:15] !log Converted bits.wikimedia.org into a direct geodns record, removed the old bits -> bits-geo CNAME [08:41:19] Logged the message, Master [08:45:26] hey [08:49:42] mark: how can I find a spare server in eqiad for v6relay? [08:49:47] or what else can I do [08:50:29] hi [08:50:36] lemme find you one in a bit [08:50:43] normally you should ask rob, who coordinates that [08:52:48] New patchset: ArielGlenn; "one job queue across all wikis" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/10403 [08:53:47] so, what's the status? [08:53:56] what are you working on? [08:56:49] ok, take server 'nitrogen' [08:56:58] i'll put it in the public vlan now [08:56:59] RECOVERY - Puppet freshness on spence is OK: puppet ran at Wed Jun 6 08:56:46 UTC 2012 [08:58:31] ok, done [08:58:39] you may need to put it in dns (rev/forward) [08:58:54] i'm waiting on spence to finish its puppet run for nagios [08:59:00] i've prepared the upcoming dns changes [08:59:05] and I'm pretty much ready to go [09:00:17] which IP nitrogen should take? [09:00:27] do we manage allocations somehow? [09:00:58] any free one in the respective subnet, 208.80.154.0/26 [09:01:01] we just use rev dns [09:01:18] add v6 while you're at it ;) [09:01:35] and hurry... I'm gonna do some important dns changes soon ;) [09:01:51] gonna get a coffee, and then i'll start [09:06:25] RECOVERY - Host capella is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [09:06:52] RECOVERY - SSH on capella is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:10:10] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [09:11:28] mark: :wq [09:11:29] :P [09:11:36] (you have wikimedia.org open) [09:11:53] done [09:13:35] * Damianz sits back and watches mark break wikipedia [09:14:29] ~. [09:15:23] oh? :) [09:15:35] my connection broke [09:16:53] http://i2.kym-cdn.com/photos/images/original/000/035/232/Internet-Don_t_worry_Tron.jpg [09:16:56] * Damianz hides [09:17:46] New patchset: Mark Bergsma; "Handle IPv6 LVS monitoring a bit differently, add bits/upload.pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10404 [09:18:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10404 [09:19:19] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10403 [09:19:21] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/10403 [09:20:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10404 [09:20:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10404 [09:20:41] New patchset: Faidon; "Add nitrogen to linux-host-entries" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10405 [09:21:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10405 [09:22:46] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.107 seconds [09:23:38] paravoid: so are you done with dns? [09:23:49] yep [09:24:01] well, commited, not authdns-update yet [09:24:10] then do that now [09:24:19] or I can [09:24:43] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10405 [09:24:45] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10405 [09:25:04] done. [09:25:08] thanks [09:25:19] lvs monitoring in nagios is teh suck [09:25:29] we really need to fix that some day [09:28:26] New patchset: Mark Bergsma; "Fix LVS checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10406 [09:28:40] PROBLEM - Auth DNS on ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [09:28:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10406 [09:29:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10406 [09:29:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10406 [09:29:15] eh, what's this alert? [09:29:17] did I break it? [09:29:38] ns1 does not respond [09:30:06] that's "normal" [09:30:06] a bug [09:30:08] just restart it [09:30:18] happens often during dns updates, it's a deadlock bug [09:30:23] fixed in newer pdns which we'll roll soon [09:30:55] I restarted it (on linne) [09:31:00] I did it as well :) [09:31:02] heh [09:31:28] RECOVERY - Auth DNS on ns1.wikimedia.org is OK: DNS OK: 0.030 seconds response time. www.wikipedia.org returns 208.80.154.225 [09:34:20] New patchset: Mark Bergsma; "Monitor the same IPv6 LVS services for eqiad and esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10407 [09:34:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10407 [09:35:08] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10407 [09:35:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10407 [09:37:49] New patchset: Faidon; "Make nitrogen a role::ipv6relay" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10408 [09:38:02] yikes [09:38:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10408 [09:38:28] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10408 [09:38:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10408 [09:38:44] New patchset: Mark Bergsma; "Fix LVS checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10409 [09:39:05] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10409 [09:39:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10409 [09:39:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10409 [09:40:48] so, is it live? [09:40:52] no [09:41:01] how about now? [09:41:03] ;) [09:41:15] * mark shoots ryan [09:41:18] no, it's not (a)live [09:41:20] heh [09:41:22] * apergos gets popcorn [09:48:59] esams upload https is not working [09:49:21] for ipv6, or at all? [09:49:25] ipv6 [09:49:44] nginx doesn't listen on it [09:49:56] sec [09:50:01] on ssl3001... [09:50:04] on 3002 it works [09:50:35] only ssl3001 [09:50:48] perhaps we should just depool it [09:50:55] it's a one-off host, with the ipv6.labs stuff on it [09:50:56] should for now [09:51:02] will you? [09:51:05] sure [09:51:10] !log depooling ssl3001 [09:51:14] hopefully 2 hosts is enough [09:51:15] Logged the message, Master [09:51:16] because 3004 is also down ;) [09:51:20] yeah [09:51:51] should I just remove the ipv6.labs stuff? [09:52:03] I think that may still be used in enwiki's common.js or whatever it's called now [09:52:08] ah [09:52:09] we should probably get that removed soon [09:53:03] I don't see why the ipv6 stuff isn't added for 3001 [09:54:00] Instead of an ipv6day can we have a fixeverything day, like you take those 200 bugs that are not important enough to dedicate time to but are annoying as fuck and just smash them in a day? [09:54:04] Would be more productive :D [09:54:33] which ops bugs would those be? [09:54:57] Anything that was written betwean the hours of 11pm and 9am [09:55:14] * Ryan_Lane shrugs [09:55:18] Or we could have an 'implimentoauthday' ;) [09:55:41] most of what you are talking about is dev and not ops [09:56:22] * Damianz gives Ryan_Lane some od that devops thing [09:56:42] Ops have commit access too :D [09:57:26] alright [09:57:34] it's time to give upload some ipv6 traffic ;) [09:57:48] * Damianz hides for when you break his bot [09:58:28] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [09:59:55] New patchset: Ryan Lane; "Fix ordering of includes so that ipv6 will work on ssl3001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10410 [10:00:00] !log Added AAAA record to upload.wikimedia.org [10:00:06] Logged the message, Master [10:00:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10410 [10:00:25] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10410 [10:00:28] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10410 [10:00:36] \o/ [10:00:48] I see traffic [10:02:19] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [10:02:37] PROBLEM - mysqld processes on bellin is CRITICAL: Connection refused by host [10:02:37] PROBLEM - MySQL Idle Transactions on bellin is CRITICAL: Connection refused by host [10:02:37] PROBLEM - MySQL Slave Running on bellin is CRITICAL: Connection refused by host [10:02:51] !log repooling ssl3001 [10:02:55] PROBLEM - MySQL Recent Restart on bellin is CRITICAL: Connection refused by host [10:02:55] PROBLEM - MySQL disk space on bellin is CRITICAL: Connection refused by host [10:02:55] Logged the message, Master [10:03:22] PROBLEM - MySQL Replication Heartbeat on bellin is CRITICAL: Connection refused by host [10:03:22] PROBLEM - NTP on bellin is CRITICAL: NTP CRITICAL: No response from NTP server [10:03:38] Weirdly I can't connect to http but I wouldn't be surprised if my tunnel is broken. [10:03:56] http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?servicegroup=lvs&style=detail [10:03:58] * Damianz drop kicks the office firewall into the carpark [10:04:30] ! [10:04:38] Oooh - you're using pybal magic for v6too? Gotta check that code out later. [10:04:49] * Damianz waits for you to be boring and point out bgp doesn't care [10:06:03] mark: not the time, but nitrogen's ready [10:06:12] cool [10:06:19] so you want me to add statics? [10:06:22] PROBLEM - MySQL Slave Delay on bellin is CRITICAL: Connection refused by host [10:06:22] PROBLEM - Full LVS Snapshot on bellin is CRITICAL: Connection refused by host [10:06:40] PROBLEM - SSH on bellin is CRITICAL: Connection refused [10:07:10] sure [10:07:44] let me know which [10:07:47] to which nexthop [10:08:03] 2620:0:861:1:208:80:154:17 [10:08:08] nitrogen's AAAA :-) [10:08:24] remind me, 2002::/16 and 2001::/32, right? [10:09:10] yes [10:10:28] Ohai [10:10:58] paravoid: done [10:11:01] thanks :-) [10:11:08] please check if they do what they need to do ;) [10:12:59] gah, I can't reassign tickets to myself [10:13:04] that aren't unowned [10:13:14] can I just upgrade myself to admin? :) [10:13:29] you can [10:13:33] noone can, in fact [10:13:38] you first need to make them unowned [10:13:40] then to yourself [10:14:54] i hear no screams yet [10:14:56] next: bits? :) [10:16:49] you'd almost think we're not complete lunatics [10:16:50] * Damianz imagines mark's desk with one of those big red usb buttons and an arrow pointing to it that reads 'abort' [10:16:52] ah, no, you just hit the button called "steal" [10:19:44] !log Added AAAA record to bits.wikimedia.org [10:19:48] Logged the message, Master [10:20:49] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [10:23:03] RECOVERY - SSH on bellin is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:23:03] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.160 seconds [10:23:12] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [10:28:00] RECOVERY - MySQL disk space on bellin is OK: DISK OK [10:29:12] RECOVERY - MySQL Slave Running on bellin is OK: OK replication [10:29:21] RECOVERY - MySQL Idle Transactions on bellin is OK: OK longest blocking idle transaction sleeps for seconds [10:29:21] RECOVERY - MySQL Recent Restart on bellin is OK: OK seconds since restart [10:29:48] RECOVERY - MySQL Replication Heartbeat on bellin is OK: OK replication delay seconds [10:30:06] RECOVERY - Full LVS Snapshot on bellin is OK: OK no full LVM snapshot volumes [10:30:06] RECOVERY - MySQL Slave Delay on bellin is OK: OK replication delay seconds [10:31:27] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [10:32:41] everything seems to work [10:34:16] yeah [10:34:18] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.186 seconds [10:34:23] shall we do the rest? :) [10:34:26] except mobile [10:35:36] yes! [10:35:41] alright then [10:37:00] RECOVERY - NTP on bellin is OK: NTP OK: Offset 0.03532397747 secs [10:37:02] !log Added AAAA records to all non-mobile wiki projects [10:37:07] Logged the message, Master [10:37:40] \o/ [10:37:49] awesome! [10:37:53] oh god [10:38:09] !log Wikipedia is IPv6-enabled. [10:38:13] Logged the message, Master [10:38:19] yay [10:39:30] that gets a retweet from me! [10:39:48] is twitter ipv6 enabled? [10:39:51] Wth's the sal twitter again? [10:40:13] no [10:40:18] Damianz: @wikimediatech [10:40:26] :) [10:40:52] Aww ryan went with his own thing [10:41:01] Hell I'll just spam my followers because I don't spam them enough [10:41:27] wikimediatech gets spammed enough that I'd prefer it not have a billion followes [10:41:30] *followers [10:41:37] lol [10:41:44] I don't have twitter [10:41:47] That's why I don't follow it - I'd only ever see sal [10:42:03] Might follow it on identica assuming you're OS friendly as I don't use identica that much. [10:42:30] mark: If twitter get ipv6 will you have twitter? [10:42:31] :D [10:42:43] I do have twitter, so I'll gladly steal all of your credit :D [10:43:27] RECOVERY - mysqld processes on bellin is OK: PROCS OK: 1 process with command name mysqld [10:43:41] that's fine with me [10:43:46] i'll gladly steal your salary instead [10:43:50] :D [10:43:51] you can have the credit [10:44:00] You mean they pay him !? [10:46:54] yup, some people are being paid, you know... [10:47:37] definitely not danny_b... ;-) [10:48:01] meh. I should have waited to tweet. no one is awake in the US [10:48:02] the smiley should have been :-( [10:48:13] See if we didn't pay you we'd only have to run donations once a year (totally ripping off his if everyone donated 5$ or w/e) [10:51:17] http://en.wikipedia.org/w/index.php?title=Special:RecentChanges&limit=500&hideliu=1 [10:51:54] I'm monitoring the relays [10:51:59] not too much of a traffic [10:52:03] good [10:52:10] people should get native ;) [10:52:33] :) [10:52:46] hmm, we put the v6 on different servers [10:52:58] just lvs [10:52:59] so I guess we can see the v6 traffic [10:53:03] yes [10:53:03] yes [10:53:40] however there's more pybal monitoring traffic than actual ipv6 it seems ;-) [10:54:28] heh, yeah [11:01:17] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [11:04:47] so, I'd like to move LVS servers to separate ganglia groups [11:04:52] any tips before I start digging? [11:05:03] Heyo. Quick questions Do you guys know whether rows in the externallinks table are ever deleted? [11:05:11] you might want a JCB [11:05:22] declerambaul: in theory... they should be when they are removed from a page [11:05:23] In theory [11:05:29] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.195 seconds [11:06:13] * Damianz offers paravoid tnt [11:07:27] Reedy: Thanks. Does in theory mean that the data could be pretty noisy? [11:08:09] Yeah [11:08:21] I think linksupdate does it... But I'm not in a position to go digging in the code [11:11:17] Nono that's fine thanks. Somebody did work on external link stats and it seemed that there is a lot of spam in there was likely removed from the actual wiki pages. [11:12:42] !log Added AAAA record to mobile [11:12:46] Logged the message, Master [11:12:54] now i'm really done [11:13:02] Awww I can't get on facebook with my ipv6 interface up [11:15:51] yaay [11:17:03] neat, I get 15ms less with ipv6 [11:17:21] to esams [11:17:47] New patchset: Mark Bergsma; "Add mobile IPv6 LVS service IP to lvs1004" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10415 [11:17:48] :P [11:18:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10415 [11:18:24] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10415 [11:18:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10415 [11:22:43] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [11:22:52] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 26 seconds [11:24:04] RECOVERY - MySQL Slave Delay on db1042 is OK: OK replication delay 0 seconds [11:25:25] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.177 seconds [11:29:19] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [11:29:34] paravoid: for selective-answer.py... [11:29:42] it would be good to be able to support prefixes instead of just /32 ips [11:29:56] so we can blacklist an entire prefix instead of having to find out where their resolvers live [11:30:10] ok, will fix [11:30:13] i have a feeling we're gonna need that script in not too long ;) [11:30:15] makes sense [11:30:30] and perhaps make it v6 aware too [11:30:55] we're not gonna make our auth servers answer on v6 just yet, but some day we probably will [11:31:17] do we have geoip data for ipv6? [11:31:36] not yet [11:32:47] Nameservers are kinda a weird one, lots of people have broken ipv6 that just happens to get an ra then suddenly they can't resolve anything and kittens get worried... still the isps fault though [11:33:47] mark: do you mind if I merge https://gerrit.wikimedia.org/r/#/c/9798/ before working on prefixes? [11:33:58] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [11:33:58] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [11:33:58] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [11:34:04] go ahead, the script is effectively inactive now [11:34:16] upload.esams.wikimedia.org is no longer used [11:34:31] New patchset: Pyoungmeister; "adding in db1042 as s1 slave" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10416 [11:34:37] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10416 [11:35:10] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.185 seconds [11:35:32] Change abandoned: Pyoungmeister; "ben is going to regen." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10227 [11:35:33] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9798 [11:35:36] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9798 [11:41:48] mark: gah, dobson is hardy... I wanted to use python-ipaddr [11:41:59] yeah you should [11:42:04] ah that was why I didn't do that then ;) [11:42:06] not in hardy [11:42:09] use it anyway [11:42:10] well, I'll backport it [11:42:19] we'll upgrade that box soon anyway [11:42:21] mchenry is hardy too btw [11:42:22] I can always use IPy, but I prefer IPAddr [11:42:24] both are auth [11:45:14] New patchset: Pyoungmeister; "changing searchidx partman conf to give larger root partition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10417 [11:45:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10417 [11:46:01] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10417 [11:46:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10417 [11:46:56] New patchset: Mark Bergsma; "Add mobile site v6 monitoring in eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10418 [11:46:58] paravoid: can I do a merge on sockpuppet? [11:47:06] your stuff is in the queue [11:47:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10418 [11:47:23] yes [11:47:28] ok, cool [11:47:59] (mine too ;) [11:49:43] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [11:53:06] New patchset: Mark Bergsma; "Add monitoring for the mobile v4 site as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10419 [11:53:28] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10418 [11:53:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10419 [11:53:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10418 [11:55:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10419 [11:55:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10419 [11:55:16] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.174 seconds [11:57:14] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [12:05:50] New patchset: Mark Bergsma; "Add remaining non-SSL LVS service monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10420 [12:06:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10420 [12:06:52] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.184 seconds [12:10:37] New patchset: Mark Bergsma; "Add remaining non-SSL LVS service monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10420 [12:11:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10420 [12:11:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10420 [12:11:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10420 [12:12:55] capella's network just had a big bump [12:13:05] who knows [12:13:14] how big [12:14:18] 4mbit [12:14:22] from 0.8 or so [12:14:41] nah, dropped again [12:18:07] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [12:19:15] mark: I'm worried a bit of the efficiency of the subnet matching [12:20:00] do we only have the relay deployed in one datacenter? [12:22:01] New patchset: Mark Bergsma; "Rename mobile service IP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10422 [12:22:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10422 [12:22:53] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10422 [12:22:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10422 [12:23:55] Ryan_Lane: no, faidon added an eqiad one this morning [12:24:05] ah, so none in esams [12:24:11] nope [12:24:33] * Damianz wonders how long it will take for ipv6 to push a gb/s of not asshat traffic [12:25:06] esams is using a Telia 6to4 relay which is 10ms away [12:25:14] in london [12:25:23] * Ryan_Lane nods [12:25:46] then it's unlikely we'll see much of a bump in traffic on the relay until the US peak [12:26:32] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.168 seconds [12:26:49] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [12:28:53] mark: not surfnet's??? [12:28:56] that's crazy [12:29:13] we were hitting an Amsterdam relay from Florida but we're hitting a London relay from Amsterdam?! [12:29:32] bgp is nice eh [12:29:40] Bgp is magic [12:29:49] Also us peak?! My us team is already up lol [12:36:52] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.176 seconds [12:38:40] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [12:39:03] mark: http://www.worldipv6launch.org/form/?q=1 [12:39:12] who can fill that out to add us to http://www.worldipv6launch.org/participants/?q=1 [12:39:32] i don't know if we'll enable it permanently [12:41:00] ah ok [12:42:02] Wait until people don't yell about it being broken first lol [12:43:32] Thehelpfulone: also, it's way past the deadline (May 30th) [12:44:27] yeah I noticed that once I had posted it [12:46:28] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [12:54:49] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [12:56:10] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.161 seconds [13:07:34] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.257 seconds [13:14:09] New patchset: Mark Bergsma; "Combine IPv6 HTTP/HTTPs monitoring, monitor HTTPS as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10427 [13:14:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10427 [13:15:12] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10427 [13:15:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10427 [13:25:07] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [13:25:19] New patchset: Faidon; "selective-answer.py: support prefix matches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10428 [13:25:33] mark: ^^^ [13:25:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10428 [13:25:44] I've tested it and works btw [13:27:03] cool ;) [13:27:12] I'll +1 [13:27:19] if you're really careful with the dns auth servers, you can deploy it imho [13:27:39] I have to backport python-ipaddr first [13:27:46] or upgrade to precise :P [13:28:05] I have a feeling you might not want me to "just upgrade" our auth NS [13:28:09] :P [13:28:12] hehe [13:28:14] no [13:28:23] hmm it's a linear search? [13:28:34] that'll do for a while [13:28:37] how else would you do it though? [13:28:41] although if the list grows too large a trie would be better [13:28:58] yeah, that's why I told earlier that I'm worried about performance [13:29:15] trie implementations for python do exist [13:29:19] hm, there's python-radix [13:29:21] I used one in combination with bgp.py I think [13:29:22] yeah that one [13:29:53] interesting [13:29:59] let's have a look [13:33:02] now I want v6 support in labs. [13:33:06] how hard can it be [13:33:12] can't we just edit some db table? :P [13:35:58] mark: We'll let you upgrade openstack, Ryan deserves a break from being yelled at for causing labs outages :D [13:38:01] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.217 seconds [13:42:44] New patchset: Faidon; "selective-answer.py: support prefix matches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10428 [13:42:50] mark: ^^^ :-) [13:43:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10428 [13:43:51] better :-) [13:43:56] how's radix in hardy? [13:44:55] it's there, an older version, not sure it works [13:44:56] I'll try [13:45:14] oh hm, this is a pupper repo, I should add the dependency! [13:45:22] New review: Mark Bergsma; "Nice. :)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/10428 [13:46:58] New patchset: Faidon; "selective-answer.py: support prefix matches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10428 [13:47:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10428 [13:49:39] funny, you already had prefixes in participants [13:49:42] they just never matched! [13:49:58] I did? [13:50:02] I don't think I added those [13:52:14] New patchset: Mark Bergsma; "Fix check_command name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10431 [13:52:33] mark: just tested on dobson [13:52:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10431 [13:52:50] installed python-radix by hand and have /root/selective-answer.py, /root/selective-test and /root/participants [13:52:56] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10431 [13:52:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10431 [13:53:01] ./selective-answer.py < selective-test [13:53:02] seems to work! [13:53:08] deploy it then [13:53:12] just be really careful [13:53:38] you did review it for me, don't you want to give the +1? :) [13:53:48] I did [13:53:56] I'll deploy it in a sec, I'll do a break [13:54:03] and won't attempt changing it before the break :) [13:54:05] ttyl [13:54:05] mark nagios is borking. message is: Error: Service check command 'check_https_lvs' specified in service 'LVS HTTPS IPv6' for host 'wiktionary-lb.pmtpa.wikimedia.org' not defined anywhere! [13:54:12] times 32 [13:54:13] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/10428 [13:54:24] notpeter: just fixed that [13:54:30] mark: cool! [14:17:28] http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?servicegroup=lvs&style=detail is now complete [14:18:19] woow [14:19:13] what about search? [14:19:44] search? [14:19:53] is it ipv6 enabled? [14:20:01] search is internal [14:20:03] or does the outside world never hit it [14:20:05] ah. right [14:20:06] no [14:20:14] some eqiad ipv4 lvs is not listed yet [14:20:15] it goes through the api [14:20:18] yeah [14:21:17] Totally should use ipv4 everywhere and just have a few public ipv4 addresses on the active lb instances :D [14:21:21] s/ipv4/ipv6/ [14:21:28] wikibooks in eqiad isn't showing ipv4 https [14:21:44] neither is wikinews [14:22:13] that's what I said [14:22:57] heh. I misread some [14:22:58] I need a nap [14:23:02] me too [14:24:18] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [14:24:36] have a look at lvs.pp to see why ;) [14:24:40] I did it slightly differently now [14:24:48] but still need to migrate the old v4 stuff [14:37:09] ah. ok [14:37:53] New patchset: Mark Bergsma; "Make IPv6 LVS service monitoring critical" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10435 [14:38:12] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/10435 [14:39:54] New patchset: Mark Bergsma; "Make IPv6 LVS service monitoring critical" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10435 [14:40:03] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.280 seconds [14:40:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10435 [14:40:34] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10435 [14:40:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10435 [14:43:21] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [14:53:33] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [14:56:33] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [14:59:58] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.161 seconds [15:08:49] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [15:09:39] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10428 [15:09:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10428 [15:11:40] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.207 seconds [15:14:04] ugh oh [15:14:43] I think i just broke dns [15:14:46] mark: around? [15:14:55] how so? [15:15:03] what's broken about it? [15:15:14] i pushed a change to selective answer and now I don't get replies from the NS [15:15:19] even though I don't understand why [15:16:14] oh no [15:16:17] pebcak [15:16:18] phew. [15:16:22] hehe [15:16:39] I had like 10 terminals open [15:16:44] worrying that this might happen [15:16:51] that I actually misread and thought it happened :) [15:16:55] sorry for the noise. [15:17:16] I keep hearing mark in my head saying how I should break the site to become a proper member of the team :P [15:17:21] New patchset: Jgreen; "puppet voodoo to rename user mmullie-->mlitn per RT 3080" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10441 [15:17:37] I'm getting replies from dns [15:17:43] yes [15:17:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10441 [15:17:45] is it only ipv6 that's broken? [15:17:47] nevermind [15:17:49] nothing's broken [15:17:52] except my head [15:17:52] :D [15:18:57] the unrelated puppet errors didn't help me not panicing :) [15:19:02] can someone take a look at the puppet user rename foo I'm about to merge? [15:19:46] if I knew it was a false alarm, I would have totally made the bot screw with you [15:20:26] Ryan_Lane: yes because we need more trauma around here :-] [15:20:41] we at least need more pranking :) [15:20:45] mark: so, the changes as in, whenever you want we can switch into a blacklist [15:20:52] and put prefixes in [15:20:58] yes [15:21:01] i'm just writing a mail [15:21:05] I propose to keep it as is now [15:21:11] that record in it is now unused, so can be used for testing [15:21:15] domas replied that there is no shared blacklist after all [15:21:18] then if we need to, we just change it into an active record [15:21:24] ok [15:23:49] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [15:23:57] should I install a v6relay in esams too? [15:24:14] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10441 [15:24:16] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10441 [15:24:26] meh [15:24:28] there are so many around [15:24:34] let's deal with that later :) [15:24:44] okay [15:24:48] wanna do bgp? [15:24:51] for relays? [15:24:52] not today [15:24:56] okay :) [15:25:01] I can attempt to do it myself [15:25:03] and thus not for the rest of the week :) [15:25:18] if you add me to the junipers [15:25:26] even though I prefer if you'd be around [15:25:32] I wouldn't be surprised if the statics actually disappear when the nexthops don't arp [15:25:32] so, not for the rest of the week :-) [15:25:38] yeah let's do so next week [15:25:42] sure [15:25:47] are you taking Thu/Fri off? [15:25:47] i'm sure it'll be fin e;) [15:25:50] mostly [15:25:56] i'll pay attention but won't do work [15:25:59] i'm tired :) [15:26:03] yeah, same here [15:26:10] I stayed up until the actual launch yesterday too [15:26:14] which was 3am :) [15:26:18] hehe yeah [15:26:28] plus another half an hour for the traffic to appear [15:26:35] see ops list [15:28:41] * domas points at https://www.facebook.com/notes/facebook-engineering/under-the-hood-network-implementation-for-world-ipv6-launch/10150873176303920 [15:29:14] oh, cool, thanks! [15:29:34] nice [15:31:06] suprisingly detailed [15:32:28] domas: btw, does mysql 5.6 finally support ipv6? [15:32:45] probably not [15:35:49] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:39:20] mark, I wil llikely be around only half days thurs and fri (I guess you are CT's proxy) because I will have out of town guests. You know them... berlin hackathon refugees :-P [15:39:31] hehehe [15:41:13] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.277 seconds [15:52:55] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [15:55:52] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:40] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.272 seconds [16:23:01] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [16:27:31] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [16:31:52] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.163 seconds [16:43:25] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.204 seconds [16:53:37] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [16:55:42] hi [16:55:46] hi! [16:55:57] msrk - just read your email [16:56:19] woot! ipv6 went out [16:56:40] hi jeremyb [16:57:55] haha, you missed the boat on that one [16:58:14] * jeremyb is still catching up [16:58:53] ya … [16:58:58] excited though [16:59:05] that was the first email I read [17:00:44] is "Increased uselessly low $wgBlockCIDRLimit default for IPv6" going out? [17:00:48] !g 10387 [17:00:48] https://gerrit.wikimedia.org/r/10387 [17:02:42] !log mailman 'site' password changed per RT 3039 [17:02:47] Logged the message, Master [17:03:30] jeremyb: I just approved it [17:05:03] Jeff_Green: and casey's (kibble) password is unchanged? [17:05:29] that's a list admin password? yeah I didn't touch that level [17:05:44] Jeff_Green: mmsitepass has a flag for list creator [17:06:03] I did not do list creator [17:06:07] k [17:06:41] is casey using that password, or is there a per-list admin password? [17:07:15] every list has a list admin and moderator pass. there's also global admin and creator passes [17:07:23] i think creator can only be used to create [17:07:25] ok that';s what I thought [17:07:36] site admin can be used just about anywhere a pass is needed [17:07:42] right [17:07:56] list admin can be used anywhere moderator is needed. moderator is the least privileged [17:08:14] k [17:08:48] RoanKattouw: erm, that should be reported in #mediawiki no? [17:08:49] afaik the request was only for site admin, but it was a little unclear [17:09:21] Grr did someone quiet the bot again? [17:09:41] which bot? [17:10:56] gerrit-wm [17:11:06] I just unquieted it [17:11:10] so many bots [17:11:32] but it wasn't quieted since it last spoke? afaict [17:11:46] at least not with the mask you removed [17:13:54] Ryan_Lane: get the SIM all worked out? [17:14:01] no [17:14:01] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.187 seconds [17:14:43] ;( [17:24:48] Jeff_Green: who requested for it to be changed (the site password)? [17:25:06] philippe [17:25:24] mark: ping [17:25:27] ah that's fine, else I was going to tell him :) [17:25:39] Thehelpfulone: :-) [17:25:39] notpeter: ping [17:25:54] Ryan_Lane: ping [17:30:04] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [17:30:55] preilly: yes? [17:32:43] New patchset: awjrichards; "Config change for rt 3073, enables wiki session cookies to be passed from varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10459 [17:33:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10459 [17:33:12] notpeter: can you push a varnish change once awjr gets it done? [17:33:34] sure [17:33:38] notpeter: e.g., this one — https://gerrit.wikimedia.org/r/#/c/10459/ [17:33:41] preilly, notpeter: it is done: https://gerrit.wikimedia.org/r/#/c/10459/ [17:33:43] o [17:33:50] awjr: ha ha [17:34:22] does that mean purge cache too? [17:34:34] jeremyb: no [17:35:50] so, is done? [17:35:53] good to go out? [17:36:02] preilly why is the varnish config so selective about cookies it passes? [17:36:23] notpeter, is done just needs review and push [17:36:36] ok [17:37:08] New review: Pyoungmeister; "both patrick and arthur said this was good to go out, so I'm going to push it live!" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10459 [17:37:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10459 [17:37:38] ok, it has been merged on sockpuppet [17:41:06] thanks notpeter [17:41:32] no prob [17:44:37] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [17:44:46] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.197 seconds [18:00:43] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [18:02:49] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.160 seconds [18:08:40] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [18:14:40] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.190 seconds [18:17:23] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/9109 [19:04:43] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [19:05:41] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.165 seconds [19:18:21] Logged the message, Master [19:22:02] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [19:40:20] RECOVERY - SSH on storage3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:40:29] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [19:40:47] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay seconds [19:40:56] RECOVERY - MySQL disk space on storage3 is OK: DISK OK [19:42:08] RECOVERY - Puppet freshness on storage3 is OK: puppet ran at Wed Jun 6 19:41:54 UTC 2012 [19:46:07] HOLY MIRACLE OF MIRACLES! [19:46:12] cmjohnson1: works [19:46:34] woah [19:46:54] * Jeff_Green dies, then comes back from the dead and demands brains. [19:55:31] Logged the message, Master [19:55:45] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [19:59:57] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [20:05:57] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.167 seconds [20:14:39] RECOVERY - Host search32 is UP: PING WARNING - Packet loss = 80%, RTA = 0.25 ms [20:16:18] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.202 seconds [20:30:24] PROBLEM - Puppet freshness on bellin is CRITICAL: Puppet has not run in the last 10 hours [20:35:08] cmjohnson1: about to go to bed, but what's up? [20:36:59] no prob! [20:47:03] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [21:06:44] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.163 seconds [21:08:37] As requested, I have created https://www.mediawiki.org/wiki/GerritShouldDieInAFire [21:15:53] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [21:17:03] dschoon: <3 name [21:18:44] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.315 seconds [21:21:47] New patchset: Jgreen; "change aluminium/grosley default mysql client charset from latin1 to binary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10534 [21:22:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10534 [21:22:35] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10534 [21:22:38] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10534 [21:24:59] New patchset: Jgreen; "stupid permissions fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10535 [21:25:22] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10535 [21:25:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10535 [21:25:22] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10535 [21:31:16] New patchset: Jgreen; "disabling cron scripts on storage3 for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10536 [21:31:38] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10536 [21:31:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10536 [21:31:39] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10536 [21:35:14] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [21:35:14] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [21:35:14] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [21:35:23] RECOVERY - mysqld processes on storage3 is OK: PROCS OK: 1 process with command name mysqld [21:39:44] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 2683934 seconds [21:46:29] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [22:09:07] New patchset: Platonides; "Convert http:// links in protocol-relative ones at MobileFrontend variables. $wgMFFeedbackFallbackURL seems unused, though." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10537 [22:09:13] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10537 [22:19:52] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.171 seconds [22:27:50] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9597 [22:27:52] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9597 [22:39:40] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [22:42:25] New review: preilly; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10537 [22:42:27] Change merged: preilly; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10537 [22:50:55] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:08:28] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [23:09:49] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.160 seconds [23:10:25] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:20:46] dschoon: do you ever go to these events: http://www.meetup.com/San-Francisco-Metrics-Meetup/events/64435452/ ? [23:36:23] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [23:40:43] RECOVERY - Backend Squid HTTP on cp1002 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.162 seconds