[00:16:56] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3036
[00:19:00] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:25:00] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.896 seconds
[00:59:47] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:01:26] <nagios-wm>	 PROBLEM - Host ssl3001 is DOWN: PING CRITICAL - Packet loss = 100%
[01:01:26] <nagios-wm>	 PROBLEM - Host ssl3003 is DOWN: PING CRITICAL - Packet loss = 100%
[01:01:35] <nagios-wm>	 PROBLEM - Host wikibooks-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[01:01:36] <nagios-wm>	 PROBLEM - Host wikinews-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[01:01:37] <nagios-wm>	 PROBLEM - Host wikipedia-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[01:01:37] <nagios-wm>	 PROBLEM - Host wikisource-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[01:02:02] <nagios-wm>	 RECOVERY - Host ssl3003 is UP: PING OK - Packet loss = 0%, RTA = 120.02 ms
[01:02:02] <nagios-wm>	 RECOVERY - Host ssl3001 is UP: PING OK - Packet loss = 0%, RTA = 118.80 ms
[01:02:11] <nagios-wm>	 RECOVERY - Host wikibooks-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 119.89 ms
[01:02:20] <nagios-wm>	 RECOVERY - Host wikipedia-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 118.96 ms
[01:02:38] <nagios-wm>	 RECOVERY - Host wikisource-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 118.06 ms
[01:02:56] <nagios-wm>	 RECOVERY - Host wikinews-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 119.19 ms
[01:05:47] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.889 seconds
[01:07:17] <nagios-wm>	 PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours
[01:17:20] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[01:18:14] <nagios-wm>	 PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours
[01:18:14] <nagios-wm>	 PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours
[01:19:17] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[01:19:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours
[01:19:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours
[01:22:17] <nagios-wm>	 PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours
[01:22:17] <nagios-wm>	 PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours
[01:22:17] <nagios-wm>	 PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours
[01:22:17] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours
[01:23:11] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 17.3815615179 (gt 8.0)
[01:23:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours
[01:26:20] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[01:26:20] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[01:31:26] <nagios-wm>	 RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.371558125
[01:32:02] <nagios-wm>	 PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 8.00294721739 (gt 8.0)
[01:36:14] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours
[01:38:02] <nagios-wm>	 RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.287232807018
[01:40:35] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:44:20] <nagios-wm>	 PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours
[01:45:14] <nagios-wm>	 PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours
[01:46:35] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.177 seconds
[01:55:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours
[01:55:17] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:17] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:18] <nagios-wm>	 PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:18] <nagios-wm>	 PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:19] <nagios-wm>	 PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:19] <nagios-wm>	 PROBLEM - Puppet freshness on knsq25 is CRITICAL: Puppet has not run in the last 10 hours
[02:01:20] <nagios-wm>	 PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:17] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:17] <nagios-wm>	 PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:18] <nagios-wm>	 PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:18] <nagios-wm>	 PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:19] <nagios-wm>	 PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:19] <nagios-wm>	 PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:20] <nagios-wm>	 PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours
[02:10:21] <nagios-wm>	 PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours
[02:10:21] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3004 is CRITICAL: Puppet has not run in the last 10 hours
[02:11:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours
[02:11:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours
[02:11:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours
[02:11:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours
[02:11:20] <nagios-wm>	 PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours
[02:11:20] <nagios-wm>	 PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours
[02:14:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours
[02:14:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours
[02:14:20] <nagios-wm>	 PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours
[02:14:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours
[02:14:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours
[02:14:20] <nagios-wm>	 PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours
[02:14:20] <nagios-wm>	 PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours
[02:14:21] <nagios-wm>	 PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours
[02:14:21] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours
[02:14:22] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[02:15:14] <nagios-wm>	 PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours
[02:16:17] <nagios-wm>	 PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours
[02:16:17] <nagios-wm>	 PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours
[02:17:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours
[02:17:20] <nagios-wm>	 PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours
[02:17:20] <nagios-wm>	 PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours
[02:19:17] <nagios-wm>	 PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours
[02:21:23] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:25:26] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.789 seconds
[06:33:22] <nagios-wm>	 PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours
[06:37:16] <nagios-wm>	 PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours
[06:41:19] <nagios-wm>	 PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours
[06:46:48] <nagios-wm>	 PROBLEM - Puppet freshness on virt1 is CRITICAL: Puppet has not run in the last 10 hours
[06:56:51] <nagios-wm>	 PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours
[07:40:36] <nagios-wm>	 PROBLEM - Puppet freshness on mw53 is CRITICAL: Puppet has not run in the last 10 hours
[08:34:14] <gerrit-wm>	 New patchset: ArielGlenn; "add 10.64.16 to hosts for common/httpdconf sync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3106
[08:34:26] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3106
[08:35:47] <gerrit-wm>	 New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3106
[08:35:50] <gerrit-wm>	 Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3106
[09:26:25] <nagios-wm>	 PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:28:22] <nagios-wm>	 RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s)
[09:36:37] <nagios-wm>	 PROBLEM - Host db1040 is DOWN: PING CRITICAL - Packet loss = 100%
[09:38:43] <nagios-wm>	 PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours
[09:38:43] <nagios-wm>	 PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours
[09:51:37] <nagios-wm>	 PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:53:34] <nagios-wm>	 RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s)
[10:16:59] <mark>	 !log Rebooting manutius with newer 2.6.36 kernel to attempt avoiding i/o kernel bug with torrus
[10:17:04] <morebots>	 Logged the message, Master
[10:34:11] <gerrit-wm>	 New patchset: Mark Bergsma; "Do HTCP loss monitoring on the upload eqiad servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3107
[10:34:23] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3107
[10:34:48] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3107
[10:34:51] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3107
[10:36:08] <mark>	 apergos: why not 10.64.16.0/22?
[10:36:30] <apergos>	 how would I know which it is?
[10:36:39] <apergos>	 sorry, I didn't know where to look for that
[10:36:42] <mark>	 from the router config for example
[10:36:51] <mark>	 or from any host in that subnet
[10:37:03] <apergos>	 lemme go look at one
[10:37:29] <apergos>	 I didnt deploy it yet, there's undeployed stuff of leslie's, I didn't know if it could go so I sent an email
[10:37:35] <mark>	 i am deploying it right now
[10:37:38] <apergos>	 ok
[10:37:52] <mark>	 if it can't go in she shouldn't have merged it
[10:38:06] <mark>	 what am I supposed to do, just wait for 8 hours until she gets back and not do anything myself? ;)
[10:38:21] <apergos>	 well cherry pick yours I guess :-P
[10:38:28] <mark>	 hell no
[10:38:38] <mark>	 we shouldn't ever cherry pick on there
[10:38:43] <mark>	 that causes conflicts and shit
[10:38:48] <apergos>	 ok I see it's 22 by looking at one of the hosts
[10:38:52] <apergos>	 thank you for that
[10:39:06] <mark>	 you might as well add .32.0/22
[10:39:09] <apergos>	 well that is why I didn't cherry pick, I wasn't sure about the consequences for later
[10:39:11] <mark>	 that's gonna be row C soon
[10:39:13] <apergos>	 ok
[10:39:23] <apergos>	 I'll do that right now
[10:39:31] <mark>	 or, even better
[10:39:43] <mark>	 you can make a puppet list of private production subnets in network.pp
[10:39:47] <mark>	 and reference that from the template
[10:40:11] <mark>	 we already have the overarching prefixes in there
[10:40:15] <mark>	 just not individual subnets
[10:41:30] <apergos>	 in the nerwotk constants you mean?
[10:41:35] <mark>	 yes
[10:42:09] <apergos>	 so what's the easy way to generate that list?
[10:42:19] <mark>	 I don't like the easy way, I want the proper way ;p
[10:42:26] <apergos>	 ok. so what's the proper way?
[10:42:41] <mark>	 I think it should be a hash containing realm, site, public/private
[10:43:03] <apergos>	 um, you're answering a different question
[10:43:18] <mark>	 so subnets => { 'production' => { 'pmtpa' => { 'public' => [ "10.0.0.0/16" ] } } }
[10:43:29] <mark>	 it's on the routers
[10:43:33] <apergos>	 ok
[10:43:40] <mark>	 good practice ;)
[10:43:45] <apergos>	 I'll take it
[10:43:45] <mark>	 also v4/v6 perhaps
[10:43:49] <mark>	 as not all tools will support both
[10:43:55] <mark>	 i'm happy to review what you have
[10:44:02] <apergos>	 you'll be reviewing it allright
[10:44:09] <apergos>	 at leasy someone sure will
[10:44:31] <mark>	 I review everything
[10:44:35] <mark>	 just not always before merge ;)
[10:44:42] <apergos>	 this one's gonna be befor merge
[10:44:49] <apergos>	 ok so I'm sorry to keep asking dumb questions but
[10:46:15] <apergos>	 thanks
[10:46:19] <mark>	 distinguish between public, private, and labs I guess
[10:46:19] <mark>	 although labs is also a different realm
[10:46:31] <mark>	 yeah, it's a realm, so nevermind
[10:46:38] <apergos>	 so public and private?
[10:46:47] <mark>	 well as a realm it can have public and private
[10:46:58] <mark>	 realms are: production, labs, fundraising
[10:47:05] <mark>	 within those we have public/private subnets
[10:47:15] <apergos>	 ok
[10:47:15] <mark>	 and datacenters (sites)
[10:47:24] <mark>	 that should be reflected in the hash structure
[10:47:52] <apergos>	 so any time someone adds a subnet they need to remember to go to puppet and change i there too
[10:47:57] <apergos>	 (in the future)
[10:48:06] <mark>	 yes
[10:48:13] <mark>	 in one place, instead of 50 config files
[10:48:18] <apergos>	 heh
[10:49:20] <nagios-wm>	 PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 283 MB (3% inode=61%): /var/lib/ureadahead/debugfs 283 MB (3% inode=61%):
[10:49:25] <gerrit-wm>	 New patchset: Mark Bergsma; "include nagios::configuration so $master_hosts can be referenced" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3108
[10:49:37] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3108
[10:49:45] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3108
[10:49:48] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3108
[10:50:35] <mark>	 btw if you're not sure someone's undeployed change is gonna break something, you can of course always revert it
[10:50:39] <mark>	 they can revert that again later
[10:50:47] <mark>	 I guess that's the best way of going at it
[10:50:57] <mark>	 leslie's change just broke something, i'm fixing it now, but I could have reverted it instead
[10:51:26] <nagios-wm>	 RECOVERY - Disk space on srv220 is OK: DISK OK
[10:54:47] <apergos>	 ok, I didn't think of that but it makes sense
[10:55:22] <gerrit-wm>	 New patchset: Mark Bergsma; "Fix varnishhtcpd path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3109
[10:55:34] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3109
[10:55:48] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3109
[10:55:51] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3109
[11:02:13] <mark>	 !log Rebooting lvs1002 with kernel updates
[11:02:15] <mark>	 brb
[11:02:16] <morebots>	 Logged the message, Master
[11:08:50] <nagios-wm>	 PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours
[11:18:53] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[11:19:47] <nagios-wm>	 PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours
[11:19:47] <nagios-wm>	 PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours
[11:20:50] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[11:20:50] <nagios-wm>	 PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours
[11:20:50] <nagios-wm>	 PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours
[11:23:50] <nagios-wm>	 PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours
[11:23:50] <nagios-wm>	 PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours
[11:23:50] <nagios-wm>	 PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours
[11:23:50] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours
[11:24:53] <nagios-wm>	 PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours
[11:27:53] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[11:27:53] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[11:32:16] <gerrit-wm>	 New patchset: Mark Bergsma; "Try a dynamic lookup, global is not working" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3110
[11:32:29] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3110
[11:32:46] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3110
[11:32:49] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3110
[11:37:47] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours
[11:40:19] <gerrit-wm>	 New patchset: Mark Bergsma; "Install socat for unicast->multicast relaying" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3111
[11:40:31] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3111
[11:42:21] <apergos>	 what is the "sandbox subnet"?
[11:42:23] <apergos>	 208.80.152.228/27
[11:43:36] <apergos>	 and do the virt hosts (virt1-4) count as labs or something else?
[11:45:53] <nagios-wm>	 PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours
[11:46:47] <nagios-wm>	 PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours
[11:47:23] <gerrit-wm>	 New patchset: Mark Bergsma; "Migrate CDN logging to our GLOP multicast address range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3112
[11:47:35] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3112
[11:48:05] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3111
[11:48:07] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3111
[11:48:28] <apergos>	 mark?
[11:48:35] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3112
[11:48:38] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3112
[11:48:41] <mark>	 yes?
[11:48:50] <mark>	 ah
[11:48:57] <mark>	 sandbox subnet is a separate public subnet
[11:49:03] <mark>	 and the virt hosts are in production
[11:49:06] <mark>	 the virt guests are not
[11:49:24] <apergos>	 ok
[11:53:14] <mark>	 if you put them in a hash
[11:53:21] <mark>	 give the subnet names as well, I think
[11:53:24] <mark>	 that would be most flexible
[11:53:32] <mark>	 the eqiad subnets are all named public1-a-eqiad or private1-c-eqiad etc
[11:53:35] <mark>	 the tampa ones are a bit messy
[11:53:48] <mark>	 but they're called internal, pub-services, pub-services2, sandbox, squid-lvs
[11:53:50] <mark>	 etc
[11:54:07] <apergos>	 uh huh, I think I have the names
[11:54:19] <apergos>	 I don't have anything for fundraiser so someone else will have to add that astuff
[11:54:30] <mark>	 there is no separate subnet for that yet
[11:54:48] <apergos>	 ok then I will not make a stanza for it
[11:56:41] <gerrit-wm>	 New patchset: Mark Bergsma; "Subscribe to upstart job changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3113
[11:56:53] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3113
[11:57:04] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3113
[11:57:07] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3113
[11:59:56] <mark>	 !log Setup squid logging to oxygen, with oxygen relaying to multicast 233.58.59.1
[11:59:59] <morebots>	 Logged the message, Master
[12:58:25] <mark>	 !log Seeding the eqiad upload caches from live upload requests
[12:58:28] <morebots>	 Logged the message, Master
[13:23:15] <apergos>	 mark, how does this look?  http://p.defau.lt/?t221vaRygUNaRL1SvO1zZQ
[13:24:57] <mark>	 pretty awesome on first sight
[13:25:05] <mark>	 i'll have to look at it a bit better though
[13:25:59] <apergos>	 sure
[13:26:36] <mark>	 and now you'll need to find a way to "flatten" that hash down into a list in the config files
[13:26:43] <apergos>	 yeah
[13:26:45] <mark>	 it would be awesome if you could specify it at any level
[13:26:47] <apergos>	 I was thinking that
[13:26:56] <mark>	 say $all_network_subnets['production']
[13:27:12] <apergos>	 I don't know how we would do that
[13:27:17] <mark>	 i'm just wondering about how to filter out ipv4 or ipv6
[13:27:24] <mark>	 well it can be done with ruby i'm sure
[13:27:49] <mark>	 oh the ipv6 subnets are wrong
[13:27:54] <apergos>	 heh
[13:27:56] <mark>	 that should just be 2620:0:861:1::/64
[13:28:05] <apergos>	 ok, I didn't know about that
[13:28:09] <apergos>	 I tried looking up the syntax
[13:28:14] <apergos>	 guess that was a fail
[13:28:38] <mark>	 but overall this looks pretty good
[13:28:45] <apergos>	 lemme fix those
[13:28:57] <mark>	 why not commit this and then try to get it into a config such as rsync
[13:29:11] <apergos>	 oh, I cleaned up some but not others. I see
[13:30:31] <apergos>	 ok well I will commit this today and then I think I will try rsync tomorrow, so I can conceivably get other things done today
[13:31:30] <gerrit-wm>	 New patchset: Mark Bergsma; "Swift response times are problematic, request only from Squid for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3114
[13:31:42] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3114
[13:31:45] <apergos>	 I "normalized" the vlan names so they don't have caps and spaces in them as some do on the routers, hope that's ok
[13:31:47] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3114
[13:31:49] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3114
[13:32:59] <mark>	 that's great
[13:50:05] <gerrit-wm>	 New patchset: ArielGlenn; "hash of all subnets in network constants" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3115
[13:50:15] <gerrit-wm>	 New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3115
[13:58:45] <mark>	 !log Sending traffic from Argentina to upload-lb.eqiad
[13:58:48] <morebots>	 Logged the message, Master
[14:32:58] <mark>	 !log Sending traffic from Brazil to upload-lb.eqiad
[14:33:02] <morebots>	 Logged the message, Master
[14:42:26] <gerrit-wm>	 New patchset: Hashar; "hash of all subnets in network constants" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3115
[14:42:38] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3115
[14:42:52] <hashar>	 apergos: I have fixed your change :)
[14:43:04] <apergos>	 thanks
[14:43:16] <apergos>	 I did not look at it after the push
[14:43:32] <hashar>	 that is what I thought
[14:43:34] <apergos>	 so I didn't even check if lint liked it
[14:43:45] <apergos>	 it is in mark's queue for review
[14:44:07] <hashar>	 that was a good exercise to play with git-review / git
[14:44:26] <apergos>	 are you feeling well? :-P
[14:44:42] <hashar>	 I like gerrit :-))))))))
[14:46:03] <apergos>	 it is not growing on me
[14:46:12] <apergos>	 unless it's growing on me like mold :-P :-P
[14:51:18] <mark>	 !log Sending traffic from Canada to upload-lb.eqiad
[14:51:21] <morebots>	 Logged the message, Master
[15:12:55] <LeslieCarr>	 !log manually deleted cp1025 info from nagios config file - nagios restored for now
[15:12:58] <morebots>	 Logged the message, Mistress of the network gear.
[15:13:28] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours
[15:13:28] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours
[15:13:28] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours
[15:13:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours
[15:13:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours
[15:13:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours
[15:13:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours
[15:14:40] <nagios-wm>	 PROBLEM - Host db1020 is DOWN: PING CRITICAL - Packet loss = 100%
[15:15:29] <gerrit-wm>	 New patchset: Mark Bergsma; "Fix LVS setup of payments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3116
[15:15:41] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3116
[15:17:30] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3116
[15:17:32] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3116
[15:22:37] <nagios-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 7, down: 1, shutdown: 0BRPeering with AS64600 not established - BR
[15:24:25] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:25:19] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.710 seconds
[15:27:01] <mark>	 !log Rebooting lvs1005 with upgraded kernel/packages
[15:27:04] <morebots>	 Logged the message, Master
[15:28:32] <mark>	 !log Sending traffic from the USA to upload-lb.eqiad
[15:28:35] <morebots>	 Logged the message, Master
[15:29:04] <nagios-wm>	 PROBLEM - Host lvs1005 is DOWN: PING CRITICAL - Packet loss = 100%
[15:30:34] <nagios-wm>	 RECOVERY - Host lvs1005 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms
[15:31:01] <nagios-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 8, down: 0, shutdown: 0
[15:31:46] <nagios-wm>	 PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call
[15:31:58] <mark>	 fixing
[15:32:40] <nagios-wm>	 RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.148 seconds response time. www.wikipedia.org returns 208.80.154.225
[15:36:25] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.92723975 (gt 8.0)
[15:47:38] <nagios-wm>	 PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 235 MB (3% inode=61%): /var/lib/ureadahead/debugfs 235 MB (3% inode=61%):
[15:50:47] <nagios-wm>	 RECOVERY - Disk space on srv219 is OK: DISK OK
[15:52:17] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.7209579167 (gt 8.0)
[15:54:25] <gerrit-wm>	 New patchset: Lcarr; "Cleaning up icinga config Moved files from nagios3 directory, notify proper service, etc" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3117
[15:54:37] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3117
[15:58:17] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:01:17] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.398 seconds
[16:02:41] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3117
[16:02:43] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3117
[16:05:29] <nagios-wm>	 PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 280 MB (3% inode=61%): /var/lib/ureadahead/debugfs 280 MB (3% inode=61%):
[16:16:23] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: NRPE: Unable to read output
[16:16:23] <nagios-wm>	 PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 0 MB (0% inode=0%):
[16:16:23] <nagios-wm>	 PROBLEM - Disk space on ms1002 is CRITICAL: DISK CRITICAL - free space: /export/upload 62299 MB (0% inode=87%):
[16:16:44] <nagios-wm>	 PROBLEM - Memcached on marmontel is CRITICAL: Connection refused
[16:16:51] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db49 is CRITICAL: NRPE: Unable to read output
[16:16:59] <nagios-wm>	 PROBLEM - Memcached on srv254 is CRITICAL: Connection refused
[16:16:59] <nagios-wm>	 PROBLEM - mysqld processes on db1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[16:17:08] <nagios-wm>	 PROBLEM - RAID on virt1 is CRITICAL: CRITICAL: Degraded
[16:17:26] <nagios-wm>	 PROBLEM - mysqld processes on db56 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[16:17:26] <nagios-wm>	 PROBLEM - MySQL replication status on es1002 is CRITICAL: (Return code of 255 is out of bounds)
[16:17:35] <nagios-wm>	 PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No
[16:17:35] <nagios-wm>	 PROBLEM - MySQL master status on es1001 is CRITICAL: CRITICAL: Read only: expected OFF, got ON
[16:17:44] <nagios-wm>	 PROBLEM - mysqld processes on db1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[16:17:44] <nagios-wm>	 PROBLEM - Disk space on db1047 is CRITICAL: DISK CRITICAL - free space: /a 6895 MB (0% inode=99%):
[16:17:44] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 403 Forbidden
[16:17:44] <nagios-wm>	 PROBLEM - MySQL slave status on es1002 is CRITICAL: CRITICAL: Lost connection to MySQL server at reading initial communication packet, system error: 111
[16:17:44] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:17:53] <nagios-wm>	 PROBLEM - LDAP on nfs1 is CRITICAL: Connection refused
[16:17:53] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:18:02] <nagios-wm>	 PROBLEM - MySQL disk space on db1047 is CRITICAL: DISK CRITICAL - free space: /a 6894 MB (0% inode=99%):
[16:18:02] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db48 is CRITICAL: NRPE: Unable to read output
[16:18:02] <nagios-wm>	 PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 0 MB (0% inode=0%):
[16:18:11] <nagios-wm>	 PROBLEM - LDAPS on nfs1 is CRITICAL: Connection refused
[16:18:20] <nagios-wm>	 PROBLEM - Backend Squid HTTP on knsq25 is CRITICAL: Connection refused
[16:19:23] <nagios-wm>	 PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 80 MB (1% inode=61%): /var/lib/ureadahead/debugfs 80 MB (1% inode=61%):
[16:21:29] <nagios-wm>	 RECOVERY - Disk space on srv220 is OK: DISK OK
[16:22:32] <nagios-wm>	 RECOVERY - Disk space on ms1002 is OK: DISK OK
[16:24:20] <nagios-wm>	 PROBLEM - Lucene on searchidx1001 is CRITICAL: Connection refused
[16:36:20] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.6114966667 (gt 8.0)
[16:36:47] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:37:39] <LeslieCarr>	 !log reinstalling neon
[16:37:42] <morebots>	 Logged the message, Mistress of the network gear.
[16:39:11] <nagios-wm>	 PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours
[16:40:50] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.863 seconds
[16:43:14] <nagios-wm>	 PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours
[16:44:41] <mark>	 !log Sending traffic from Japan, India, Mexico to upload-lb.eqiad
[16:44:45] <morebots>	 Logged the message, Master
[16:52:41] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.79082958333 (gt 8.0)
[16:59:36] <nagios-wm>	 PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours
[17:00:52] <gerrit-wm>	 New patchset: Jgreen; "pgehres storage3 shell access per RT 2610" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3118
[17:01:04] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3118
[17:01:30] <gerrit-wm>	 New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3118
[17:01:33] <gerrit-wm>	 Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3118
[17:12:18] <mark>	 !log Sending all normally-pmtpa upload traffic to upload-lb.eqiad
[17:12:21] <morebots>	 Logged the message, Master
[17:13:46] <woosters>	 mark, there was a spike earlier .. what caused it?
[17:13:57] <mark>	 no idea
[17:14:03] <woosters>	 on upload i mean
[17:14:23] <mark>	 oh perhaps my testing?
[17:14:36] <mark>	 my testing of live traffic from the squids
[17:15:05] <woosters>	 otherwise, the load barely exercising those servers
[17:15:31] <woosters>	 hmm starting to climb a little now
[17:15:57] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.772 seconds
[17:16:24] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:22:15] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:22:33] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.049 seconds
[17:23:45] <mark>	 yeah
[17:23:53] <gerrit-wm>	 New patchset: Bhartshorne; "changing lvs and nagios to check for a file in swift directly rather than going through the swift rewrite stuff for thumbnails to protect against the thumbnail getting deleted (second try)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3119
[17:24:02] <mark>	 so 8 caches with perhaps more powerful CPUs, a bit more memory and larger SSDs can handle it
[17:24:05] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3119
[17:24:13] <gerrit-wm>	 Change abandoned: Bhartshorne; "retried in change 3119" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3036
[17:24:16] <mark>	 that means that we can do a caching center with 24 servers I think
[17:26:26] <mark>	 maplebed: what I said yesterday was wrong: varnish doesn't track service times
[17:26:32] <mark>	 I think because that would be fairly expensive to do
[17:27:09] <maplebed>	 mark: bummer.
[17:27:43] <maplebed>	 oh well.  at least we have it from swift's perspective.
[17:28:06] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.87275554622 (gt 8.0)
[17:29:32] <mark>	 yeah
[17:30:05] <mark>	 i'm a bit concerned that we can't really front swift with a non-persistent varnish cache either
[17:30:11] <mark>	 it's not a whole lot more performant than ms5
[17:30:21] <woosters>	 mark - i see you got oxygen to start multicasting - yea!
[17:30:26] <mark>	 yep
[17:30:33] <mark>	 but not everything is received by it yet
[17:30:41] <woosters>	 dederik is going to be very happy
[17:30:46] <mark>	 however, eqiad handles the multicast traffic fine
[17:30:49] <mark>	 yeah you can tell him ;)
[17:31:59] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3119
[17:32:02] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3119
[17:32:11] <woosters>	 drdee - gelukkig om u te vertellen dat 'multicasting' in 'eqiad' werkt - sounds correct ? ;-)
[17:32:28] <mark>	 haha
[17:32:30] <mark>	 not really
[17:32:36] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.429 seconds
[17:32:36] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.614 seconds
[17:32:43] <mark>	 "ik ben blij u te mogen vertellen dat..."
[17:32:50] <mark>	 sounds very formal though ;)
[17:33:40] <maplebed>	 mark: just fyi I'm merging and testing and pushing https://gerrit.wikimedia.org/r/#change,3119 right now.
[17:33:52] <maplebed>	 in case upload shit starts breaking it might not be your change.
[17:33:53] <maplebed>	 :P
[17:34:24] <mark>	 hehe lookintg
[17:34:55] <maplebed>	 it's the same change as https://gerrit.wikimedia.org/r/#change,3036 which ryan and daniel both reviewed
[17:34:56] <mark>	 I see
[17:35:27] <maplebed>	 lvs4 is currently active so I'm delpoying it to lvs3 first to test.
[17:39:03] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:39:03] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:39:34] <maplebed>	 mark: would you object to installing curl on the lvs servers to make it easier to test a change?
[17:39:47] <mark>	 no
[17:40:04] <maplebed>	 I can't test it from off-host because I can't force the IP addr for the service to go to the inactive host.
[17:40:15] <mark>	 right
[17:40:35] <mark>	 but indeed testing on lvs3 works too
[17:40:52] <maplebed>	 how do you usually test a change?
[17:41:03] <mark>	 on the inactive host
[17:41:09] <mark>	 and/or with telnet to port 80 for such a thing
[17:41:11] <maplebed>	 curl -o /tmp/foo -vvv -H "Host: upload.wikimedia.org" http://10.2.1.27/wikipedia/commons/thumb/a/a2/Little_kitten_.jpg/46px-Little_kitten_.jpg is my test, but curl isn't installed.
[17:41:15] <maplebed>	 oh, you just telnet?
[17:41:17] <maplebed>	 hrmph.
[17:41:18] <mark>	 yeah
[17:41:20] <maplebed>	 yeah, I can do that.
[17:41:23] <mark>	 but feel free to apt-get install curl ;)
[17:41:25] <mark>	 it doesn't hurt at all
[17:41:28] <maplebed>	 puppet!
[17:41:37] <mark>	 or even puppet
[17:41:41] <maplebed>	 :D
[17:41:43] <mark>	 I do such things manually when I need them
[17:41:54] <nagios-wm>	 PROBLEM - Puppet freshness on mw53 is CRITICAL: Puppet has not run in the last 10 hours
[17:42:01] <mark>	 like e.g tshark too
[17:43:11] <maplebed>	 hm.
[17:43:36] <maplebed>	 I can't connect to 10.2.1.27 on port 80 from on lvs3 or 4.
[17:43:41] <maplebed>	 though I can from off-host.
[17:43:49] <maplebed>	 ::sigh::
[17:44:18] <LeslieCarr>	 maplebed: want me to check out the tubes part ?
[17:44:41] <maplebed>	 LeslieCarr: no thanks - lvs4 is certainly the active one.
[17:46:43] <maplebed>	 ok, I'm going to rely on the pybal log to verify the change works rather than a telnet test.
[17:46:59] <maplebed>	 mark:  maybe you can walk me through how to test stuff on the lvs servers when this is done and you have a few minutes?
[17:47:25] <binasher>	 !log power cycling db1040, crashed again
[17:47:27] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.710 seconds
[17:47:27] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.719 seconds
[17:47:28] <morebots>	 Logged the message, Master
[17:47:56] <maplebed>	 !log pybal restarted on lvs3
[17:47:59] <morebots>	 Logged the message, Master
[17:50:24] <mark>	 maplebed: that's because the lvs host itself has that ip
[17:50:26] <nagios-wm>	 RECOVERY - Host db1040 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms
[17:50:35] <mark>	 you need to connect to the real server ip
[17:50:51] <mark>	 there is no (easy) way to connect to the service IP on the real servers
[17:50:53] <nagios-wm>	 PROBLEM - NTP on db1040 is CRITICAL: NTP CRITICAL: Offset unknown
[17:51:02] <mark>	 it would involve crafting your own tcp packets and bypassing the linux routing table
[17:51:38] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: CRIT replication delay 29891 seconds
[17:51:53] <maplebed>	 mark: but doesn't the lvs server choose which backend to use based on the service IP?
[17:52:02] <gerrit-wm>	 New patchset: Lcarr; "Reenabling icinga install on neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3120
[17:52:07] <mark>	 maplebed: no, it connects using the normal service ips (so cp1021, etc)
[17:52:14] <gerrit-wm>	 New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3120
[17:52:42] <mark>	 when you telnet to 10.2.1.27 you telnet to localhost
[17:52:45] <maplebed>	 mark: no, what I mean is that if I connect to 10.2.1.27, it knows it's supposed to send traffic to swift and not, say, search.
[17:53:04] <mark>	 yes
[17:53:12] <mark>	 but it doesn't work locally
[17:53:15] <mark>	 only for incoming traffic
[17:53:20] <maplebed>	 so if I connect to localhost on lvs3, how does it know which service I'm trying to test?
[17:53:32] <mark>	 it doesn't
[17:53:35] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:53:42] <maplebed>	 so then how do I test the change?
[17:53:43] <mark>	 you can't do anything with the service ip on the LVS server itself
[17:53:44] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:53:58] <mark>	 you test by connecting to ms-fe1/ms-fe2 directly, like pybal itself does too
[17:54:07] <maplebed>	 well, I did that part already...
[17:54:14] <mark>	 if that works it'll be fine
[17:54:15] <maplebed>	 but that's not really testing my change.
[17:54:22] <mark>	 so just apply that change in pybal.conf, and see if it works
[17:54:27] <mark>	 since lvs3 is not really active, it's fine
[17:54:40] <maplebed>	 right, how do I see if it works? (without making lvs3 active)
[17:54:49] <maplebed>	 just the pybal log?
[17:54:49] <mark>	 by checking /var/log/pybal.log
[17:54:51] <mark>	 and also ipvsadm -l
[17:54:56] <nagios-wm>	 RECOVERY - NTP on db1040 is OK: NTP OK: Offset 0.003578186035 secs
[17:54:57] <mark>	 indeed
[17:55:00] <maplebed>	 hmm...
[17:55:05] <mark>	 if it doesn't work it'll mark the hosts as down
[17:55:41] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.108 seconds
[17:56:00] <maplebed>	 well, it says the hosts are up but I don't see the requests for my health check file on ms-fe1's access log.
[17:56:17] <nagios-wm>	 PROBLEM - MySQL Slave Running on db1040 is CRITICAL: CRIT replication Slave_IO_Running: No Slave_SQL_Running: No Last_Error: Rollback done for prepared transaction because its XID was not in the
[17:56:21] <mark>	 weird
[17:56:33] <maplebed>	 it may be that it's not getting logged on ms-fe1
[17:56:35] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:56:41] <mark>	 tcpdump perhaps
[17:56:47] <maplebed>	 doing it now.
[17:56:47] <mark>	 tcpdump -i eth0 host lvs3.pmtpa.wmnet
[17:56:56] <mark>	 then you only see traffic from lvs3, not client traffic
[17:57:12] <mark>	 it's doing requests every 30s or so
[17:57:46] <gerrit-wm>	 New patchset: Lcarr; "Reenabling icinga install on neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3120
[17:57:58] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3120
[17:58:18] <maplebed>	 ah, my puppet change didn't take that time.
[17:58:19] <gerrit-wm>	 New patchset: RobH; "updated ipmi script to work a bit better, added iron into site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3121
[17:58:26] <maplebed>	 so it's still testing the old URL.
[17:58:32] <gerrit-wm>	 New patchset: RobH; " left out one tiny change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3122
[17:58:44] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3121
[17:58:44] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3122
[17:58:46] <RobH>	 aww man, it pushed them as two changes, damn it.
[17:59:10] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3120
[17:59:13] <mark>	 my food is ready
[17:59:14] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3120
[17:59:23] <mark>	 maplebed: call me if you have issues/downtime
[17:59:26] <RobH>	 i guess it works, just annoying cuz i didnt expect it to.
[17:59:30] <maplebed>	 mark: ok.
[17:59:31] <maplebed>	 thanks.
[17:59:38] <mark>	 bbl
[17:59:57] <maplebed>	 RobH: that's happened to me a bunch too.  I agree it's annoying, but it does work.
[18:00:11] <RobH>	 yea just gotta make sure i push both and do so in order i guess
[18:00:44] <gerrit-wm>	 New review: RobH; "easy changes to a server no one is using yet and a script i wrote anyhow" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3121
[18:00:47] <gerrit-wm>	 Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3121
[18:01:33] <gerrit-wm>	 New review: RobH; "updated in script help prompts" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3122
[18:01:35] <gerrit-wm>	 Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3122
[18:01:57] <RobH>	 heh, every gerrit commit makes it sound like the world is ending on my computer (my name highlight in irc)
[18:02:08] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:02:15] <RobH>	 perhaps having it post to both tech and ops is a bit overkill.
[18:02:30] <RobH>	 the origin production could just echo here
[18:02:53] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.064 seconds
[18:04:23] <nagios-wm>	 RECOVERY - MySQL Slave Running on db1040 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error:
[18:08:32] <maplebed>	 !log stopping pybal on lvs4 - should fail over to lvs3
[18:08:35] <morebots>	 Logged the message, Master
[18:09:56] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1040 is CRITICAL: CRIT replication delay 27764 seconds
[18:09:58] <binasher>	 !log power cycling db1020, which also froze this morning
[18:10:00] <maplebed>	 !log failover successful, restarted pybal on lvs4, failback successful.
[18:10:01] <morebots>	 Logged the message, Master
[18:10:05] <morebots>	 Logged the message, Master
[18:13:25] <maplebed>	 mark: any idea why traffic spiked so much when I triggered the failover?  http://screencast.com/t/Rf9za9clmJw
[18:23:08] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.152 seconds
[18:23:08] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.149 seconds
[18:28:48] <tfinc>	 Jeff_Green: yo
[18:29:18] <Jeff_Green>	 tfinc: yo. i'm running out the door to fetch my kids from school, sitter locked her keys in the car
[18:29:23] <Jeff_Green>	 this is day of madness
[18:29:35] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:29:35] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:29:35] <Jeff_Green>	 can I ping you in about 45min?
[18:29:40] <gerrit-wm>	 New patchset: Lcarr; "inserting icinga config file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3123
[18:29:45] <tfinc>	 Jeff_Green: when would be a good time to sync up about pediapress stuff ?
[18:29:46] <tfinc>	 sure
[18:29:48] <tfinc>	 seeya then
[18:29:53] <gerrit-wm>	 New patchset: Lcarr; "adding config files into git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3124
[18:30:05] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3123
[18:30:05] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3124
[18:30:06] <Jeff_Green>	 i'm not going to get to it until tomorrow--got a bunch of fundraising/civicrm stuff in my queue
[18:30:16] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3123
[18:30:19] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3123
[18:30:33] <Jeff_Green>	 later today could work assuming all the hell that's broken loose gets reigned in again. let's check in in ~45
[18:30:41] <tfinc>	 k
[18:31:57] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3124
[18:32:00] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3124
[18:37:50] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:40:32] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1040 is OK: OK replication delay 0 seconds
[18:40:32] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1040 is OK: OK replication delay 0 seconds
[18:43:59] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.905 seconds
[18:50:13] <gerrit-wm>	 New patchset: Pyoungmeister; "using these would be smart!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3125
[18:50:25] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3125
[18:51:50] <gerrit-wm>	 New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3125
[18:51:52] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3125
[18:54:47] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.765 seconds
[18:55:06] <gerrit-wm>	 New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1;  - https://gerrit.wikimedia.org/r/3115
[19:02:24] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:05:21] <binasher>	 !log streaming hotbackup of db1041 to db56 (new s7 slave replacing db18)
[19:05:24] <morebots>	 Logged the message, Master
[19:12:45] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.300 seconds
[19:15:17] <gerrit-wm>	 New patchset: Lcarr; "Putting service definition after files installed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3126
[19:15:29] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3126
[19:15:42] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3126
[19:15:45] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3126
[19:17:15] <RobH>	 !log iron updated to use ipmi_mgmt script
[19:17:18] <morebots>	 Logged the message, RobH
[19:18:27] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:19:03] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.376 seconds
[19:19:12] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:23:06] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.2599349167 (gt 8.0)
[19:24:36] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.746 seconds
[19:25:21] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:25:30] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.031 seconds
[19:31:48] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.239 seconds
[19:37:30] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.98531258333 (gt 8.0)
[19:38:06] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:38:06] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:40:03] <nagios-wm>	 PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours
[19:40:48] <nagios-wm>	 PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 132 MB (1% inode=61%): /var/lib/ureadahead/debugfs 132 MB (1% inode=61%):
[19:44:24] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.706 seconds
[19:50:42] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:56:32] <nagios-wm>	 RECOVERY - Disk space on srv219 is OK: DISK OK
[20:01:20] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:05:23] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.677 seconds
[20:07:02] <nagios-wm>	 RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 2.60560366667
[20:08:59] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.894 seconds
[20:09:26] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.022 seconds
[20:14:11] <gerrit-wm>	 New patchset: Lcarr; "Trying to ignore this as a requirement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3127
[20:14:23] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3127
[20:15:44] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:17:23] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:21:26] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.10203858333 (gt 8.0)
[20:23:41] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.177 seconds
[20:25:10] <tfinc>	 woosters: you there?
[20:27:35] <woosters>	 back
[20:27:43] <woosters>	 tfinc
[20:28:06] <tfinc>	 what else do you guys need for this hardware request? http://rt.wikimedia.org/Ticket/Display.html?id=2582
[20:28:15] <tfinc>	 its for sms/ussd
[20:29:59] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:30:36] <woosters>	 i think we have the info .. patrick has replied back to Mark's question
[20:32:55] <mark>	 yeah, just one high performance misc server in each data center then
[20:34:02] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.934 seconds
[20:34:38] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.327 seconds
[20:35:40] <tfinc>	 mark: woosters : great. if we have the hardware whats next to get it provisioned ?
[20:36:01] <mark>	 patience
[20:37:47] <nagios-wm>	 RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.557444
[20:38:26] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3127
[20:38:29] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3127
[20:39:23] <tfinc>	 patience is easy once i have a timeline/set of expectaions
[20:39:26] <nagios-wm>	 PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 131 MB (1% inode=61%): /var/lib/ureadahead/debugfs 131 MB (1% inode=61%):
[20:39:30] <woosters>	 tfinc - the ticket has been updated. i'll followup with robhalsell tomorrow
[20:39:35] <tfinc>	 expectations*
[20:39:38] <tfinc>	 woosters: thanks
[20:39:47] <mark>	 expectation is that you'll get two hosts in the next couple of days
[20:40:20] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:40:51] <mark>	 we try to keep those boxes spare
[20:40:56] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:42:26] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:43:30] <Jeff_Green>	 tfinc: want to talk pediapress?
[20:43:57] <mark>	 did you get that project jeff?
[20:44:04] <mark>	 you get all the cool stuff
[20:44:48] <Jeff_Green>	 yes, I certainly do!
[20:45:40] <woosters>	 better him than u, mark ;-P
[20:46:24] <mark>	 if I did it, the service would probably disappear
[20:47:18] <Jeff_Green>	 woosters: you know, at CL I became famous for my skills at suppressing madness. I completed the pass-the-torch training to the guy who inherited my job yesterday. It was a one-sentence training: "Repeat after me: 'No.'"
[20:47:54] <woosters>	 did u write 2 sealed letters to him as well?
[20:48:08] <woosters>	 heard that joke before?
[20:48:08] <Jeff_Green>	 hahahahahh
[20:48:18] <Jeff_Green>	 i think so yeah
[20:48:35] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.373 seconds
[20:49:56] <nagios-wm>	 RECOVERY - Puppet freshness on cp1036 is OK: puppet ran at Tue Mar 13 20:49:34 UTC 2012
[20:50:41] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.428 seconds
[20:51:53] <nagios-wm>	 RECOVERY - Disk space on srv219 is OK: DISK OK
[20:55:31] <tfinc>	 Jeff_Green: sure
[20:57:08] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:59:55] <RobH>	 woosters: i have not, pls share.
[21:01:47] <gerrit-wm>	 New patchset: Lcarr; "trying another commenting out" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3128
[21:02:00] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3128
[21:02:04] <Jeff_Green>	 tfinc: ok--so I reviewed the email thread and I have a very general idea of the situation
[21:02:17] <woosters>	 robh - http://toperjokes.blogspot.com/2007/05/two-envelopes.html
[21:02:38] <gerrit-wm>	 New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3128
[21:02:41] <gerrit-wm>	 Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3128
[21:02:53] <RobH>	 heh
[21:03:02] <RobH>	 ahh bitter joke. i like
[21:05:23] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.370 seconds
[21:10:05] <nagios-wm>	 RECOVERY - Host cp1036 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms
[21:11:17] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:17:44] <nagios-wm>	 RECOVERY - mysqld processes on db56 is OK: PROCS OK: 1 process with command name mysqld
[21:20:35] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.972 seconds
[21:20:53] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[21:21:47] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.857 seconds
[21:22:50] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[21:23:17] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:23:35] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 2591 seconds
[21:24:20] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 2293 seconds
[21:25:50] <nagios-wm>	 PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours
[21:25:50] <nagios-wm>	 PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours
[21:25:50] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours
[21:27:02] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:28:05] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:29:26] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.035 seconds
[21:29:44] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 0 seconds
[21:29:53] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[21:29:53] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[21:30:20] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 0 seconds
[21:36:25] <gerrit-wm>	 New patchset: Asher; "making sync_binlog=1 the default for prod dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3130
[21:36:37] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3130
[21:39:38] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.892 seconds
[21:39:47] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours
[21:40:32] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.567 seconds
[21:46:59] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:48:02] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:48:47] <nagios-wm>	 PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours
[21:48:56] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.665 seconds
[21:55:23] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:57:47] <awjr>	 binasher: jfyi i'm going to be udpateing MobileFrontend on the cluster probably in the 15 minutes or so - will you be around to flush the varnish cache?
[21:58:11] <awjr>	 s/the 15/the next 15
[21:58:46] <binasher>	 why will varnish need flushing?
[21:59:17] <awjr>	 binasher: word on the street is varnish needs flushing post MobileFrontend deployments
[21:59:35] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.953 seconds
[22:01:48] <preilly>	 awjr: well, we don't always need t o flush the cache
[22:02:32] <awjr>	 preilly: ok.. so when do i need to make sure the cache is flushed after a MobileFrontend deployment?
[22:02:36] <preilly>	 awjr: we only do it if the page structurally changes drastically
[22:03:12] <awjr>	 preilly: so i take it modest CSS changes and a couple of one-line bug fixes doesn't count?
[22:03:32] <preilly>	 awjr: well, basically when the page and the resources that it loads would conflict with the cached assets
[22:03:53] <preilly>	 awjr: well, the CSS and JS should have different version query strings and be okay
[22:04:09] <preilly>	 awjr: did those get updated in the ApplicationTemplate ?
[22:04:10] <awjr>	 ah ok
[22:04:23] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:04:33] <awjr>	 preilly: no they did not thanks for reminding me. this is reminding me what life was like before RL
[22:04:41] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.566 seconds
[22:05:53] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:08:26] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.035 seconds
[22:13:14] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:17:26] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.788 seconds
[22:23:14] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:23:29] <gerrit-wm>	 New patchset: Reedy; "Switch foreachwikiindblist to use MWScript.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3131
[22:23:41] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3131
[22:25:38] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.401 seconds
[22:27:17] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 4.377 seconds
[22:29:47] <gerrit-wm>	 New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1;  - https://gerrit.wikimedia.org/r/3131
[22:35:59] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:40:20] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:41:33] <gerrit-wm>	 New patchset: Bhartshorne; "removed extra slash from squid purge URLs.  purge was generating http://upload...//wikipe... rather than http://upload.../wikipe..., causing the purge to fail (silently)." [operations/software] (master) - https://gerrit.wikimedia.org/r/3132
[22:42:24] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3132
[22:42:26] <gerrit-wm>	 Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/3132
[22:43:29] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:45:32] <gerrit-wm>	 New patchset: Bhartshorne; "swiftcleaner calls htcp.php.  may as well install it along side swiftcleaner." [operations/software] (master) - https://gerrit.wikimedia.org/r/3133
[22:46:19] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3133
[22:46:21] <gerrit-wm>	 Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/3133
[22:49:47] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.817 seconds
[22:56:59] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.907 seconds
[22:56:59] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.253 seconds
[23:03:17] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:03:26] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:09:44] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.806 seconds
[23:17:50] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.088 seconds
[23:24:59] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:26:20] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:31:08] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.441 seconds
[23:34:35] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 4.862 seconds
[23:47:38] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:49:27] <gerrit-wm>	 New patchset: Bhartshorne; "first draft of the swift cleaner stuff.  I know this doesn't work but I want to check it in for reviews." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134
[23:49:39] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3134
[23:51:19] <mutante>	 !log upgrading bugzilla to 4.0.5
[23:51:22] <morebots>	 Logged the message, Master
[23:53:02] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:53:56] <gerrit-wm>	 New patchset: Bhartshorne; "first draft of the swift cleaner stuff.  I know this doesn't work but I want to check it in for reviews." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134
[23:54:09] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3134