[00:16:56] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3036 [00:19:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.896 seconds [00:59:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:26] PROBLEM - Host ssl3001 is DOWN: PING CRITICAL - Packet loss = 100% [01:01:26] PROBLEM - Host ssl3003 is DOWN: PING CRITICAL - Packet loss = 100% [01:01:35] PROBLEM - Host wikibooks-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:36] PROBLEM - Host wikinews-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:37] PROBLEM - Host wikipedia-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:01:37] PROBLEM - Host wikisource-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:02:02] RECOVERY - Host ssl3003 is UP: PING OK - Packet loss = 0%, RTA = 120.02 ms [01:02:02] RECOVERY - Host ssl3001 is UP: PING OK - Packet loss = 0%, RTA = 118.80 ms [01:02:11] RECOVERY - Host wikibooks-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 119.89 ms [01:02:20] RECOVERY - Host wikipedia-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 118.96 ms [01:02:38] RECOVERY - Host wikisource-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 118.06 ms [01:02:56] RECOVERY - Host wikinews-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 119.19 ms [01:05:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.889 seconds [01:07:17] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [01:17:20] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [01:18:14] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [01:18:14] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours [01:19:17] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [01:19:17] PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours [01:19:17] PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours [01:22:17] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [01:22:17] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [01:22:17] PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours [01:22:17] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours [01:23:11] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 17.3815615179 (gt 8.0) [01:23:20] PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours [01:26:20] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [01:26:20] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [01:31:26] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.371558125 [01:32:02] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 8.00294721739 (gt 8.0) [01:36:14] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [01:38:02] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.287232807018 [01:40:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:20] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [01:45:14] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [01:46:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.177 seconds [01:55:17] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [01:55:17] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [02:01:17] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [02:01:17] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [02:01:17] PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours [02:01:17] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours [02:01:17] PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours [02:01:17] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours [02:01:18] PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours [02:01:18] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [02:01:19] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [02:01:19] PROBLEM - Puppet freshness on knsq25 is CRITICAL: Puppet has not run in the last 10 hours [02:01:20] PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours [02:10:17] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [02:10:17] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [02:10:17] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours [02:10:17] PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours [02:10:17] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [02:10:17] PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours [02:10:18] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [02:10:18] PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours [02:10:19] PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours [02:10:19] PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours [02:10:20] PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours [02:10:20] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [02:10:21] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [02:10:21] PROBLEM - Puppet freshness on ssl3004 is CRITICAL: Puppet has not run in the last 10 hours [02:11:20] PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours [02:11:20] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [02:11:20] PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours [02:11:20] PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours [02:11:20] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [02:11:20] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [02:14:20] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [02:14:20] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [02:14:20] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [02:14:20] PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours [02:14:20] PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours [02:14:20] PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours [02:14:20] PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours [02:14:21] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [02:14:21] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours [02:14:22] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [02:15:14] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [02:16:17] PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours [02:16:17] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [02:17:20] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [02:17:20] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours [02:17:20] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [02:19:17] PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours [02:21:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.789 seconds [06:33:22] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [06:37:16] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [06:41:19] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [06:46:48] PROBLEM - Puppet freshness on virt1 is CRITICAL: Puppet has not run in the last 10 hours [06:56:51] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [07:40:36] PROBLEM - Puppet freshness on mw53 is CRITICAL: Puppet has not run in the last 10 hours [08:34:14] New patchset: ArielGlenn; "add 10.64.16 to hosts for common/httpdconf sync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3106 [08:34:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3106 [08:35:47] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3106 [08:35:50] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3106 [09:26:25] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:28:22] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [09:36:37] PROBLEM - Host db1040 is DOWN: PING CRITICAL - Packet loss = 100% [09:38:43] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [09:38:43] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [09:51:37] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:34] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [10:16:59] !log Rebooting manutius with newer 2.6.36 kernel to attempt avoiding i/o kernel bug with torrus [10:17:04] Logged the message, Master [10:34:11] New patchset: Mark Bergsma; "Do HTCP loss monitoring on the upload eqiad servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3107 [10:34:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3107 [10:34:48] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3107 [10:34:51] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3107 [10:36:08] apergos: why not 10.64.16.0/22? [10:36:30] how would I know which it is? [10:36:39] sorry, I didn't know where to look for that [10:36:42] from the router config for example [10:36:51] or from any host in that subnet [10:37:03] lemme go look at one [10:37:29] I didnt deploy it yet, there's undeployed stuff of leslie's, I didn't know if it could go so I sent an email [10:37:35] i am deploying it right now [10:37:38] ok [10:37:52] if it can't go in she shouldn't have merged it [10:38:06] what am I supposed to do, just wait for 8 hours until she gets back and not do anything myself? ;) [10:38:21] well cherry pick yours I guess :-P [10:38:28] hell no [10:38:38] we shouldn't ever cherry pick on there [10:38:43] that causes conflicts and shit [10:38:48] ok I see it's 22 by looking at one of the hosts [10:38:52] thank you for that [10:39:06] you might as well add .32.0/22 [10:39:09] well that is why I didn't cherry pick, I wasn't sure about the consequences for later [10:39:11] that's gonna be row C soon [10:39:13] ok [10:39:23] I'll do that right now [10:39:31] or, even better [10:39:43] you can make a puppet list of private production subnets in network.pp [10:39:47] and reference that from the template [10:40:11] we already have the overarching prefixes in there [10:40:15] just not individual subnets [10:41:30] in the nerwotk constants you mean? [10:41:35] yes [10:42:09] so what's the easy way to generate that list? [10:42:19] I don't like the easy way, I want the proper way ;p [10:42:26] ok. so what's the proper way? [10:42:41] I think it should be a hash containing realm, site, public/private [10:43:03] um, you're answering a different question [10:43:18] so subnets => { 'production' => { 'pmtpa' => { 'public' => [ "10.0.0.0/16" ] } } } [10:43:29] it's on the routers [10:43:33] ok [10:43:40] good practice ;) [10:43:45] I'll take it [10:43:45] also v4/v6 perhaps [10:43:49] as not all tools will support both [10:43:55] i'm happy to review what you have [10:44:02] you'll be reviewing it allright [10:44:09] at leasy someone sure will [10:44:31] I review everything [10:44:35] just not always before merge ;) [10:44:42] this one's gonna be befor merge [10:44:49] ok so I'm sorry to keep asking dumb questions but [10:46:15] thanks [10:46:19] distinguish between public, private, and labs I guess [10:46:19] although labs is also a different realm [10:46:31] yeah, it's a realm, so nevermind [10:46:38] so public and private? [10:46:47] well as a realm it can have public and private [10:46:58] realms are: production, labs, fundraising [10:47:05] within those we have public/private subnets [10:47:15] ok [10:47:15] and datacenters (sites) [10:47:24] that should be reflected in the hash structure [10:47:52] so any time someone adds a subnet they need to remember to go to puppet and change i there too [10:47:57] (in the future) [10:48:06] yes [10:48:13] in one place, instead of 50 config files [10:48:18] heh [10:49:20] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 283 MB (3% inode=61%): /var/lib/ureadahead/debugfs 283 MB (3% inode=61%): [10:49:25] New patchset: Mark Bergsma; "include nagios::configuration so $master_hosts can be referenced" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3108 [10:49:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3108 [10:49:45] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3108 [10:49:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3108 [10:50:35] btw if you're not sure someone's undeployed change is gonna break something, you can of course always revert it [10:50:39] they can revert that again later [10:50:47] I guess that's the best way of going at it [10:50:57] leslie's change just broke something, i'm fixing it now, but I could have reverted it instead [10:51:26] RECOVERY - Disk space on srv220 is OK: DISK OK [10:54:47] ok, I didn't think of that but it makes sense [10:55:22] New patchset: Mark Bergsma; "Fix varnishhtcpd path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3109 [10:55:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3109 [10:55:48] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3109 [10:55:51] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3109 [11:02:13] !log Rebooting lvs1002 with kernel updates [11:02:15] brb [11:02:16] Logged the message, Master [11:08:50] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [11:18:53] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [11:19:47] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours [11:19:47] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [11:20:50] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [11:20:50] PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours [11:20:50] PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours [11:23:50] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [11:23:50] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [11:23:50] PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours [11:23:50] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours [11:24:53] PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours [11:27:53] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [11:27:53] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [11:32:16] New patchset: Mark Bergsma; "Try a dynamic lookup, global is not working" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3110 [11:32:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3110 [11:32:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3110 [11:32:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3110 [11:37:47] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [11:40:19] New patchset: Mark Bergsma; "Install socat for unicast->multicast relaying" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3111 [11:40:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3111 [11:42:21] what is the "sandbox subnet"? [11:42:23] 208.80.152.228/27 [11:43:36] and do the virt hosts (virt1-4) count as labs or something else? [11:45:53] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [11:46:47] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [11:47:23] New patchset: Mark Bergsma; "Migrate CDN logging to our GLOP multicast address range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3112 [11:47:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3112 [11:48:05] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3111 [11:48:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3111 [11:48:28] mark? [11:48:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3112 [11:48:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3112 [11:48:41] yes? [11:48:50] ah [11:48:57] sandbox subnet is a separate public subnet [11:49:03] and the virt hosts are in production [11:49:06] the virt guests are not [11:49:24] ok [11:53:14] if you put them in a hash [11:53:21] give the subnet names as well, I think [11:53:24] that would be most flexible [11:53:32] the eqiad subnets are all named public1-a-eqiad or private1-c-eqiad etc [11:53:35] the tampa ones are a bit messy [11:53:48] but they're called internal, pub-services, pub-services2, sandbox, squid-lvs [11:53:50] etc [11:54:07] uh huh, I think I have the names [11:54:19] I don't have anything for fundraiser so someone else will have to add that astuff [11:54:30] there is no separate subnet for that yet [11:54:48] ok then I will not make a stanza for it [11:56:41] New patchset: Mark Bergsma; "Subscribe to upstart job changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3113 [11:56:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3113 [11:57:04] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3113 [11:57:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3113 [11:59:56] !log Setup squid logging to oxygen, with oxygen relaying to multicast 233.58.59.1 [11:59:59] Logged the message, Master [12:58:25] !log Seeding the eqiad upload caches from live upload requests [12:58:28] Logged the message, Master [13:23:15] mark, how does this look? http://p.defau.lt/?t221vaRygUNaRL1SvO1zZQ [13:24:57] pretty awesome on first sight [13:25:05] i'll have to look at it a bit better though [13:25:59] sure [13:26:36] and now you'll need to find a way to "flatten" that hash down into a list in the config files [13:26:43] yeah [13:26:45] it would be awesome if you could specify it at any level [13:26:47] I was thinking that [13:26:56] say $all_network_subnets['production'] [13:27:12] I don't know how we would do that [13:27:17] i'm just wondering about how to filter out ipv4 or ipv6 [13:27:24] well it can be done with ruby i'm sure [13:27:49] oh the ipv6 subnets are wrong [13:27:54] heh [13:27:56] that should just be 2620:0:861:1::/64 [13:28:05] ok, I didn't know about that [13:28:09] I tried looking up the syntax [13:28:14] guess that was a fail [13:28:38] but overall this looks pretty good [13:28:45] lemme fix those [13:28:57] why not commit this and then try to get it into a config such as rsync [13:29:11] oh, I cleaned up some but not others. I see [13:30:31] ok well I will commit this today and then I think I will try rsync tomorrow, so I can conceivably get other things done today [13:31:30] New patchset: Mark Bergsma; "Swift response times are problematic, request only from Squid for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3114 [13:31:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3114 [13:31:45] I "normalized" the vlan names so they don't have caps and spaces in them as some do on the routers, hope that's ok [13:31:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3114 [13:31:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3114 [13:32:59] that's great [13:50:05] New patchset: ArielGlenn; "hash of all subnets in network constants" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3115 [13:50:15] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3115 [13:58:45] !log Sending traffic from Argentina to upload-lb.eqiad [13:58:48] Logged the message, Master [14:32:58] !log Sending traffic from Brazil to upload-lb.eqiad [14:33:02] Logged the message, Master [14:42:26] New patchset: Hashar; "hash of all subnets in network constants" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3115 [14:42:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3115 [14:42:52] apergos: I have fixed your change :) [14:43:04] thanks [14:43:16] I did not look at it after the push [14:43:32] that is what I thought [14:43:34] so I didn't even check if lint liked it [14:43:45] it is in mark's queue for review [14:44:07] that was a good exercise to play with git-review / git [14:44:26] are you feeling well? :-P [14:44:42] I like gerrit :-)))))))) [14:46:03] it is not growing on me [14:46:12] unless it's growing on me like mold :-P :-P [14:51:18] !log Sending traffic from Canada to upload-lb.eqiad [14:51:21] Logged the message, Master [15:12:55] !log manually deleted cp1025 info from nagios config file - nagios restored for now [15:12:58] Logged the message, Mistress of the network gear. [15:13:28] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [15:13:28] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [15:13:28] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [15:13:28] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [15:13:28] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [15:13:28] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [15:13:28] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [15:14:40] PROBLEM - Host db1020 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:29] New patchset: Mark Bergsma; "Fix LVS setup of payments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3116 [15:15:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3116 [15:17:30] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3116 [15:17:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3116 [15:22:37] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 7, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [15:24:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.710 seconds [15:27:01] !log Rebooting lvs1005 with upgraded kernel/packages [15:27:04] Logged the message, Master [15:28:32] !log Sending traffic from the USA to upload-lb.eqiad [15:28:35] Logged the message, Master [15:29:04] PROBLEM - Host lvs1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:34] RECOVERY - Host lvs1005 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [15:31:01] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 8, down: 0, shutdown: 0 [15:31:46] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:31:58] fixing [15:32:40] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.148 seconds response time. www.wikipedia.org returns 208.80.154.225 [15:36:25] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.92723975 (gt 8.0) [15:47:38] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 235 MB (3% inode=61%): /var/lib/ureadahead/debugfs 235 MB (3% inode=61%): [15:50:47] RECOVERY - Disk space on srv219 is OK: DISK OK [15:52:17] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.7209579167 (gt 8.0) [15:54:25] New patchset: Lcarr; "Cleaning up icinga config Moved files from nagios3 directory, notify proper service, etc" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3117 [15:54:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3117 [15:58:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:01:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.398 seconds [16:02:41] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3117 [16:02:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3117 [16:05:29] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 280 MB (3% inode=61%): /var/lib/ureadahead/debugfs 280 MB (3% inode=61%): [16:16:23] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: NRPE: Unable to read output [16:16:23] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 0 MB (0% inode=0%): [16:16:23] PROBLEM - Disk space on ms1002 is CRITICAL: DISK CRITICAL - free space: /export/upload 62299 MB (0% inode=87%): [16:16:44] PROBLEM - Memcached on marmontel is CRITICAL: Connection refused [16:16:51] PROBLEM - MySQL Replication Heartbeat on db49 is CRITICAL: NRPE: Unable to read output [16:16:59] PROBLEM - Memcached on srv254 is CRITICAL: Connection refused [16:16:59] PROBLEM - mysqld processes on db1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:17:08] PROBLEM - RAID on virt1 is CRITICAL: CRITICAL: Degraded [16:17:26] PROBLEM - mysqld processes on db56 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:17:26] PROBLEM - MySQL replication status on es1002 is CRITICAL: (Return code of 255 is out of bounds) [16:17:35] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [16:17:35] PROBLEM - MySQL master status on es1001 is CRITICAL: CRITICAL: Read only: expected OFF, got ON [16:17:44] PROBLEM - mysqld processes on db1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:17:44] PROBLEM - Disk space on db1047 is CRITICAL: DISK CRITICAL - free space: /a 6895 MB (0% inode=99%): [16:17:44] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 403 Forbidden [16:17:44] PROBLEM - MySQL slave status on es1002 is CRITICAL: CRITICAL: Lost connection to MySQL server at reading initial communication packet, system error: 111 [16:17:44] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:17:53] PROBLEM - LDAP on nfs1 is CRITICAL: Connection refused [16:17:53] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:02] PROBLEM - MySQL disk space on db1047 is CRITICAL: DISK CRITICAL - free space: /a 6894 MB (0% inode=99%): [16:18:02] PROBLEM - MySQL Replication Heartbeat on db48 is CRITICAL: NRPE: Unable to read output [16:18:02] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 0 MB (0% inode=0%): [16:18:11] PROBLEM - LDAPS on nfs1 is CRITICAL: Connection refused [16:18:20] PROBLEM - Backend Squid HTTP on knsq25 is CRITICAL: Connection refused [16:19:23] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 80 MB (1% inode=61%): /var/lib/ureadahead/debugfs 80 MB (1% inode=61%): [16:21:29] RECOVERY - Disk space on srv220 is OK: DISK OK [16:22:32] RECOVERY - Disk space on ms1002 is OK: DISK OK [16:24:20] PROBLEM - Lucene on searchidx1001 is CRITICAL: Connection refused [16:36:20] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.6114966667 (gt 8.0) [16:36:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:39] !log reinstalling neon [16:37:42] Logged the message, Mistress of the network gear. [16:39:11] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [16:40:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.863 seconds [16:43:14] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [16:44:41] !log Sending traffic from Japan, India, Mexico to upload-lb.eqiad [16:44:45] Logged the message, Master [16:52:41] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.79082958333 (gt 8.0) [16:59:36] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [17:00:52] New patchset: Jgreen; "pgehres storage3 shell access per RT 2610" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3118 [17:01:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3118 [17:01:30] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3118 [17:01:33] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3118 [17:12:18] !log Sending all normally-pmtpa upload traffic to upload-lb.eqiad [17:12:21] Logged the message, Master [17:13:46] mark, there was a spike earlier .. what caused it? [17:13:57] no idea [17:14:03] on upload i mean [17:14:23] oh perhaps my testing? [17:14:36] my testing of live traffic from the squids [17:15:05] otherwise, the load barely exercising those servers [17:15:31] hmm starting to climb a little now [17:15:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.772 seconds [17:16:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:15] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.049 seconds [17:23:45] yeah [17:23:53] New patchset: Bhartshorne; "changing lvs and nagios to check for a file in swift directly rather than going through the swift rewrite stuff for thumbnails to protect against the thumbnail getting deleted (second try)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3119 [17:24:02] so 8 caches with perhaps more powerful CPUs, a bit more memory and larger SSDs can handle it [17:24:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3119 [17:24:13] Change abandoned: Bhartshorne; "retried in change 3119" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3036 [17:24:16] that means that we can do a caching center with 24 servers I think [17:26:26] maplebed: what I said yesterday was wrong: varnish doesn't track service times [17:26:32] I think because that would be fairly expensive to do [17:27:09] mark: bummer. [17:27:43] oh well. at least we have it from swift's perspective. [17:28:06] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.87275554622 (gt 8.0) [17:29:32] yeah [17:30:05] i'm a bit concerned that we can't really front swift with a non-persistent varnish cache either [17:30:11] it's not a whole lot more performant than ms5 [17:30:21] mark - i see you got oxygen to start multicasting - yea! [17:30:26] yep [17:30:33] but not everything is received by it yet [17:30:41] dederik is going to be very happy [17:30:46] however, eqiad handles the multicast traffic fine [17:30:49] yeah you can tell him ;) [17:31:59] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3119 [17:32:02] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3119 [17:32:11] drdee - gelukkig om u te vertellen dat 'multicasting' in 'eqiad' werkt - sounds correct ? ;-) [17:32:28] haha [17:32:30] not really [17:32:36] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.429 seconds [17:32:36] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.614 seconds [17:32:43] "ik ben blij u te mogen vertellen dat..." [17:32:50] sounds very formal though ;) [17:33:40] mark: just fyi I'm merging and testing and pushing https://gerrit.wikimedia.org/r/#change,3119 right now. [17:33:52] in case upload shit starts breaking it might not be your change. [17:33:53] :P [17:34:24] hehe lookintg [17:34:55] it's the same change as https://gerrit.wikimedia.org/r/#change,3036 which ryan and daniel both reviewed [17:34:56] I see [17:35:27] lvs4 is currently active so I'm delpoying it to lvs3 first to test. [17:39:03] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:03] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:34] mark: would you object to installing curl on the lvs servers to make it easier to test a change? [17:39:47] no [17:40:04] I can't test it from off-host because I can't force the IP addr for the service to go to the inactive host. [17:40:15] right [17:40:35] but indeed testing on lvs3 works too [17:40:52] how do you usually test a change? [17:41:03] on the inactive host [17:41:09] and/or with telnet to port 80 for such a thing [17:41:11] curl -o /tmp/foo -vvv -H "Host: upload.wikimedia.org" http://10.2.1.27/wikipedia/commons/thumb/a/a2/Little_kitten_.jpg/46px-Little_kitten_.jpg is my test, but curl isn't installed. [17:41:15] oh, you just telnet? [17:41:17] hrmph. [17:41:18] yeah [17:41:20] yeah, I can do that. [17:41:23] but feel free to apt-get install curl ;) [17:41:25] it doesn't hurt at all [17:41:28] puppet! [17:41:37] or even puppet [17:41:41] :D [17:41:43] I do such things manually when I need them [17:41:54] PROBLEM - Puppet freshness on mw53 is CRITICAL: Puppet has not run in the last 10 hours [17:42:01] like e.g tshark too [17:43:11] hm. [17:43:36] I can't connect to 10.2.1.27 on port 80 from on lvs3 or 4. [17:43:41] though I can from off-host. [17:43:49] ::sigh:: [17:44:18] maplebed: want me to check out the tubes part ? [17:44:41] LeslieCarr: no thanks - lvs4 is certainly the active one. [17:46:43] ok, I'm going to rely on the pybal log to verify the change works rather than a telnet test. [17:46:59] mark: maybe you can walk me through how to test stuff on the lvs servers when this is done and you have a few minutes? [17:47:25] !log power cycling db1040, crashed again [17:47:27] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.710 seconds [17:47:27] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.719 seconds [17:47:28] Logged the message, Master [17:47:56] !log pybal restarted on lvs3 [17:47:59] Logged the message, Master [17:50:24] maplebed: that's because the lvs host itself has that ip [17:50:26] RECOVERY - Host db1040 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [17:50:35] you need to connect to the real server ip [17:50:51] there is no (easy) way to connect to the service IP on the real servers [17:50:53] PROBLEM - NTP on db1040 is CRITICAL: NTP CRITICAL: Offset unknown [17:51:02] it would involve crafting your own tcp packets and bypassing the linux routing table [17:51:38] PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: CRIT replication delay 29891 seconds [17:51:53] mark: but doesn't the lvs server choose which backend to use based on the service IP? [17:52:02] New patchset: Lcarr; "Reenabling icinga install on neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3120 [17:52:07] maplebed: no, it connects using the normal service ips (so cp1021, etc) [17:52:14] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3120 [17:52:42] when you telnet to 10.2.1.27 you telnet to localhost [17:52:45] mark: no, what I mean is that if I connect to 10.2.1.27, it knows it's supposed to send traffic to swift and not, say, search. [17:53:04] yes [17:53:12] but it doesn't work locally [17:53:15] only for incoming traffic [17:53:20] so if I connect to localhost on lvs3, how does it know which service I'm trying to test? [17:53:32] it doesn't [17:53:35] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:42] so then how do I test the change? [17:53:43] you can't do anything with the service ip on the LVS server itself [17:53:44] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:58] you test by connecting to ms-fe1/ms-fe2 directly, like pybal itself does too [17:54:07] well, I did that part already... [17:54:14] if that works it'll be fine [17:54:15] but that's not really testing my change. [17:54:22] so just apply that change in pybal.conf, and see if it works [17:54:27] since lvs3 is not really active, it's fine [17:54:40] right, how do I see if it works? (without making lvs3 active) [17:54:49] just the pybal log? [17:54:49] by checking /var/log/pybal.log [17:54:51] and also ipvsadm -l [17:54:56] RECOVERY - NTP on db1040 is OK: NTP OK: Offset 0.003578186035 secs [17:54:57] indeed [17:55:00] hmm... [17:55:05] if it doesn't work it'll mark the hosts as down [17:55:41] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.108 seconds [17:56:00] well, it says the hosts are up but I don't see the requests for my health check file on ms-fe1's access log. [17:56:17] PROBLEM - MySQL Slave Running on db1040 is CRITICAL: CRIT replication Slave_IO_Running: No Slave_SQL_Running: No Last_Error: Rollback done for prepared transaction because its XID was not in the [17:56:21] weird [17:56:33] it may be that it's not getting logged on ms-fe1 [17:56:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:41] tcpdump perhaps [17:56:47] doing it now. [17:56:47] tcpdump -i eth0 host lvs3.pmtpa.wmnet [17:56:56] then you only see traffic from lvs3, not client traffic [17:57:12] it's doing requests every 30s or so [17:57:46] New patchset: Lcarr; "Reenabling icinga install on neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3120 [17:57:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3120 [17:58:18] ah, my puppet change didn't take that time. [17:58:19] New patchset: RobH; "updated ipmi script to work a bit better, added iron into site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3121 [17:58:26] so it's still testing the old URL. [17:58:32] New patchset: RobH; " left out one tiny change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3122 [17:58:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3121 [17:58:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3122 [17:58:46] aww man, it pushed them as two changes, damn it. [17:59:10] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3120 [17:59:13] my food is ready [17:59:14] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3120 [17:59:23] maplebed: call me if you have issues/downtime [17:59:26] i guess it works, just annoying cuz i didnt expect it to. [17:59:30] mark: ok. [17:59:31] thanks. [17:59:38] bbl [17:59:57] RobH: that's happened to me a bunch too. I agree it's annoying, but it does work. [18:00:11] yea just gotta make sure i push both and do so in order i guess [18:00:44] New review: RobH; "easy changes to a server no one is using yet and a script i wrote anyhow" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3121 [18:00:47] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3121 [18:01:33] New review: RobH; "updated in script help prompts" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3122 [18:01:35] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3122 [18:01:57] heh, every gerrit commit makes it sound like the world is ending on my computer (my name highlight in irc) [18:02:08] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:02:15] perhaps having it post to both tech and ops is a bit overkill. [18:02:30] the origin production could just echo here [18:02:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.064 seconds [18:04:23] RECOVERY - MySQL Slave Running on db1040 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [18:08:32] !log stopping pybal on lvs4 - should fail over to lvs3 [18:08:35] Logged the message, Master [18:09:56] PROBLEM - MySQL Slave Delay on db1040 is CRITICAL: CRIT replication delay 27764 seconds [18:09:58] !log power cycling db1020, which also froze this morning [18:10:00] !log failover successful, restarted pybal on lvs4, failback successful. [18:10:01] Logged the message, Master [18:10:05] Logged the message, Master [18:13:25] mark: any idea why traffic spiked so much when I triggered the failover? http://screencast.com/t/Rf9za9clmJw [18:23:08] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.152 seconds [18:23:08] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.149 seconds [18:28:48] Jeff_Green: yo [18:29:18] tfinc: yo. i'm running out the door to fetch my kids from school, sitter locked her keys in the car [18:29:23] this is day of madness [18:29:35] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:35] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:35] can I ping you in about 45min? [18:29:40] New patchset: Lcarr; "inserting icinga config file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3123 [18:29:45] Jeff_Green: when would be a good time to sync up about pediapress stuff ? [18:29:46] sure [18:29:48] seeya then [18:29:53] New patchset: Lcarr; "adding config files into git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3124 [18:30:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3123 [18:30:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3124 [18:30:06] i'm not going to get to it until tomorrow--got a bunch of fundraising/civicrm stuff in my queue [18:30:16] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3123 [18:30:19] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3123 [18:30:33] later today could work assuming all the hell that's broken loose gets reigned in again. let's check in in ~45 [18:30:41] k [18:31:57] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3124 [18:32:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3124 [18:37:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:32] RECOVERY - MySQL Replication Heartbeat on db1040 is OK: OK replication delay 0 seconds [18:40:32] RECOVERY - MySQL Slave Delay on db1040 is OK: OK replication delay 0 seconds [18:43:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.905 seconds [18:50:13] New patchset: Pyoungmeister; "using these would be smart!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3125 [18:50:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3125 [18:51:50] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3125 [18:51:52] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3125 [18:54:47] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.765 seconds [18:55:06] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3115 [19:02:24] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:21] !log streaming hotbackup of db1041 to db56 (new s7 slave replacing db18) [19:05:24] Logged the message, Master [19:12:45] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.300 seconds [19:15:17] New patchset: Lcarr; "Putting service definition after files installed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3126 [19:15:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3126 [19:15:42] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3126 [19:15:45] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3126 [19:17:15] !log iron updated to use ipmi_mgmt script [19:17:18] Logged the message, RobH [19:18:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:03] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.376 seconds [19:19:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:06] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.2599349167 (gt 8.0) [19:24:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.746 seconds [19:25:21] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:30] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.031 seconds [19:31:48] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.239 seconds [19:37:30] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.98531258333 (gt 8.0) [19:38:06] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:06] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:03] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [19:40:48] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 132 MB (1% inode=61%): /var/lib/ureadahead/debugfs 132 MB (1% inode=61%): [19:44:24] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.706 seconds [19:50:42] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:56:32] RECOVERY - Disk space on srv219 is OK: DISK OK [20:01:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.677 seconds [20:07:02] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 2.60560366667 [20:08:59] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.894 seconds [20:09:26] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.022 seconds [20:14:11] New patchset: Lcarr; "Trying to ignore this as a requirement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3127 [20:14:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3127 [20:15:44] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:17:23] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:26] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.10203858333 (gt 8.0) [20:23:41] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.177 seconds [20:25:10] woosters: you there? [20:27:35] back [20:27:43] tfinc [20:28:06] what else do you guys need for this hardware request? http://rt.wikimedia.org/Ticket/Display.html?id=2582 [20:28:15] its for sms/ussd [20:29:59] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:36] i think we have the info .. patrick has replied back to Mark's question [20:32:55] yeah, just one high performance misc server in each data center then [20:34:02] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.934 seconds [20:34:38] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.327 seconds [20:35:40] mark: woosters : great. if we have the hardware whats next to get it provisioned ? [20:36:01] patience [20:37:47] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.557444 [20:38:26] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3127 [20:38:29] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3127 [20:39:23] patience is easy once i have a timeline/set of expectaions [20:39:26] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 131 MB (1% inode=61%): /var/lib/ureadahead/debugfs 131 MB (1% inode=61%): [20:39:30] tfinc - the ticket has been updated. i'll followup with robhalsell tomorrow [20:39:35] expectations* [20:39:38] woosters: thanks [20:39:47] expectation is that you'll get two hosts in the next couple of days [20:40:20] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:51] we try to keep those boxes spare [20:40:56] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:42:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:43:30] tfinc: want to talk pediapress? [20:43:57] did you get that project jeff? [20:44:04] you get all the cool stuff [20:44:48] yes, I certainly do! [20:45:40] better him than u, mark ;-P [20:46:24] if I did it, the service would probably disappear [20:47:18] woosters: you know, at CL I became famous for my skills at suppressing madness. I completed the pass-the-torch training to the guy who inherited my job yesterday. It was a one-sentence training: "Repeat after me: 'No.'" [20:47:54] did u write 2 sealed letters to him as well? [20:48:08] heard that joke before? [20:48:08] hahahahahh [20:48:18] i think so yeah [20:48:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.373 seconds [20:49:56] RECOVERY - Puppet freshness on cp1036 is OK: puppet ran at Tue Mar 13 20:49:34 UTC 2012 [20:50:41] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.428 seconds [20:51:53] RECOVERY - Disk space on srv219 is OK: DISK OK [20:55:31] Jeff_Green: sure [20:57:08] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:59:55] woosters: i have not, pls share. [21:01:47] New patchset: Lcarr; "trying another commenting out" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3128 [21:02:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3128 [21:02:04] tfinc: ok--so I reviewed the email thread and I have a very general idea of the situation [21:02:17] robh - http://toperjokes.blogspot.com/2007/05/two-envelopes.html [21:02:38] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3128 [21:02:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3128 [21:02:53] heh [21:03:02] ahh bitter joke. i like [21:05:23] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.370 seconds [21:10:05] RECOVERY - Host cp1036 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [21:11:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:17:44] RECOVERY - mysqld processes on db56 is OK: PROCS OK: 1 process with command name mysqld [21:20:35] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.972 seconds [21:20:53] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [21:21:47] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.857 seconds [21:22:50] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [21:23:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:23:35] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 2591 seconds [21:24:20] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 2293 seconds [21:25:50] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [21:25:50] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [21:25:50] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours [21:27:02] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:05] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:29:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.035 seconds [21:29:44] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 0 seconds [21:29:53] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [21:29:53] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [21:30:20] RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 0 seconds [21:36:25] New patchset: Asher; "making sync_binlog=1 the default for prod dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3130 [21:36:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3130 [21:39:38] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.892 seconds [21:39:47] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [21:40:32] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.567 seconds [21:46:59] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:48:02] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:48:47] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [21:48:56] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.665 seconds [21:55:23] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:57:47] binasher: jfyi i'm going to be udpateing MobileFrontend on the cluster probably in the 15 minutes or so - will you be around to flush the varnish cache? [21:58:11] s/the 15/the next 15 [21:58:46] why will varnish need flushing? [21:59:17] binasher: word on the street is varnish needs flushing post MobileFrontend deployments [21:59:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.953 seconds [22:01:48] awjr: well, we don't always need t o flush the cache [22:02:32] preilly: ok.. so when do i need to make sure the cache is flushed after a MobileFrontend deployment? [22:02:36] awjr: we only do it if the page structurally changes drastically [22:03:12] preilly: so i take it modest CSS changes and a couple of one-line bug fixes doesn't count? [22:03:32] awjr: well, basically when the page and the resources that it loads would conflict with the cached assets [22:03:53] awjr: well, the CSS and JS should have different version query strings and be okay [22:04:09] awjr: did those get updated in the ApplicationTemplate ? [22:04:10] ah ok [22:04:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:04:33] preilly: no they did not thanks for reminding me. this is reminding me what life was like before RL [22:04:41] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.566 seconds [22:05:53] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:08:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.035 seconds [22:13:14] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:26] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.788 seconds [22:23:14] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:29] New patchset: Reedy; "Switch foreachwikiindblist to use MWScript.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3131 [22:23:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3131 [22:25:38] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.401 seconds [22:27:17] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 4.377 seconds [22:29:47] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3131 [22:35:59] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:40:20] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:41:33] New patchset: Bhartshorne; "removed extra slash from squid purge URLs. purge was generating http://upload...//wikipe... rather than http://upload.../wikipe..., causing the purge to fail (silently)." [operations/software] (master) - https://gerrit.wikimedia.org/r/3132 [22:42:24] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3132 [22:42:26] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/3132 [22:43:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:45:32] New patchset: Bhartshorne; "swiftcleaner calls htcp.php. may as well install it along side swiftcleaner." [operations/software] (master) - https://gerrit.wikimedia.org/r/3133 [22:46:19] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3133 [22:46:21] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/3133 [22:49:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.817 seconds [22:56:59] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.907 seconds [22:56:59] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.253 seconds [23:03:17] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:03:26] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:09:44] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.806 seconds [23:17:50] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.088 seconds [23:24:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:20] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:31:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.441 seconds [23:34:35] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 4.862 seconds [23:47:38] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:49:27] New patchset: Bhartshorne; "first draft of the swift cleaner stuff. I know this doesn't work but I want to check it in for reviews." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134 [23:49:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3134 [23:51:19] !log upgrading bugzilla to 4.0.5 [23:51:22] Logged the message, Master [23:53:02] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:56] New patchset: Bhartshorne; "first draft of the swift cleaner stuff. I know this doesn't work but I want to check it in for reviews." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134 [23:54:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3134