[00:00:01] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:00:44] yurik: well i wouldn't know the difference. for all i know you just filed it yourself [00:00:55] exactly :) [00:01:13] yurik: (or i did for that matter. there are tickets i've modififed that i can't see) [00:01:31] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [00:02:31] PROBLEM - Varnish HTTP upload-backend on cp1021 is CRITICAL: Connection refused [00:03:11] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [00:03:29] RECOVERY - Varnish HTTP upload-backend on cp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.329 second response time [00:03:41] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57210 [00:05:07] !log removed sda3/varnish.persist on cp1021, restarted varnish [00:05:14] Logged the message, Master [00:05:24] andrewbogott, is stat1 accessible directly or i should go through some bastion [00:05:44] stat1 is accessible directly [00:05:47] I believe it has a public IP but probably won't forever. [00:05:49] PROBLEM - DPKG on db1057 is CRITICAL: NRPE: Command check_dpkg not defined [00:05:51] yurik: you should bastion through bast1001 however to get practice [00:05:59] PROBLEM - Disk space on db1057 is CRITICAL: NRPE: Command check_disk_space not defined [00:06:03] stat1001 is the forever public one? [00:06:16] LeslieCarr: no, stat1 is tampa? [00:06:19] PROBLEM - RAID on db1057 is CRITICAL: NRPE: Command check_raid not defined [00:06:32] stat1 is tampa [00:06:38] so fenari [00:06:44] stat1001 is for hosting web apps, stat1 is for number crunching and will lose public ip soonish [00:06:47] well you can ssh through whatever [00:07:01] right, but why bounce around :) [00:07:14] well since most traffic is going via eqiad anyways ;) [00:08:47] ori-l, PHP Fatal error: require() [function.require]: Failed opening required '/usr/local/apache/common-local/php-1.21wmf12/extensions/PostEdit/PostEdit.hooks.php' [00:08:57] and A LOT of it in error log [00:09:19] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:09:28] yurik: anyway, may be 30ish mins before you can get everywhere unless andrew did manual puppet runs [00:09:37] andrewbogott: you merged on sockpuppet? [00:09:44] I did. [00:09:50] ori-l: Looks like 1.21wmf12 has an older version with no hooks file.. [00:10:04] danke :) [00:10:29] New patchset: Ryan Lane; "Don't require a specific version of opendj" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57226 [00:10:31] thanks andrewbogott , jeremyb_ [00:10:37] Reedy: looking, hang on [00:10:46] Neither does 1.22wmf1 [00:11:30] Reedy: but the version I just sync'ed doesn't reference that file [00:11:52] APC cache? [00:11:59] s/ cache// [00:12:39] Maybe. How do I check (and fix, if that's the problem)? [00:13:09] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [00:13:10] They stopped 12 minutes ago [00:13:12] Nothing to do [00:13:23] Just haven't been pushed out of the last 1000 lines due to a lack of other errors ;) [00:13:38] oh, I can always help with that [00:14:58] so, how do i avoid this in the future, if i need to remove a file? remove references to it, sync, and then actually remove it on a subsequent deployment? [00:15:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57226 [00:15:53] ^ Reedy [00:15:59] Seems a bit OTTYeah [00:16:02] blah [00:16:21] I wonder if rsync is removing the hooks file before the loader file has been loaded.. [00:17:19] In which case... Force the loader file first? Then sync-dir... Have an empty file to be deleted? [00:17:42] Or the easiest, don't care [00:17:43] :D [00:18:09] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:18:48] Reedy: from rsync man-page: --delete-after receiver deletes after transfer, not during [00:19:19] Are we using that? [00:19:23] * ori-l checks [00:20:23] nope [00:21:01] Sounds like it might be a good enhancement then [00:21:17] Reedy: there's also: --delete-delay find deletions during, delete after [00:21:29] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:25:24] anomie|away, did you deploy your fix? i was curious to see the process [00:25:57] yurik- Yeah, over an hour ago. What did you want to see? [00:26:14] just curious what steps are needed to do a depl like that [00:26:24] i'm sure i will have plenty of OMG bugs [00:28:09] yurik- Step 0 is talking to people (such as greg-g (sorry for the ping)). Then basically follow https://wikitech.wikimedia.org/wiki/How_to_deploy_code [00:29:14] right, so its similar to what max was showing today for the regular mobile frontend deployment. Will need to get a fenari account at some point [00:29:24] thx [00:30:00] TimStarling: ^ quick sanity check on that idea? (that is, using '--delay-updates --delete-delay' in sync-common-file to prevent brief but potentially harmful inconsistencies) [00:30:35] i'm worried that the flags are already set in some config file that i didn't know to look up [00:31:20] some useful stuff about delete-delay in this message and follow-ups: http://lists.samba.org/archive/rsync/2008-June/021107.html [00:32:29] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [00:37:46] ori-l: I think there might be a couple of places you might need to do it.. But shouldn't be hidden [00:38:22] file/scap [00:39:29] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:43:09] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [00:44:25] * jeremyb_ repeats: what to do with 4785/4685? [00:44:35] andrewbogott: whatchya think? [00:45:59] * jeremyb_ wonders if RT assumes that no one will ever make a mistake... [00:46:33] jeremyb, am I looking at what you're looking at? email for echo? redirect for wikimaps? [00:46:50] andrewbogott: look at the last 3 msgs on wikimaps [00:47:33] * jeremyb_ RT fu is too weak for this situation :P [00:48:26] Hm… probably best to bug mark about those in the morning… I have neither an opinion nor relevant skills :) [00:49:10] i was thinking merge and make a new wikimaps [00:53:09] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:54:59] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [00:56:47] i just got a 503 from upload [00:58:02] yeah, there's network problems, machines keep maxing out [00:58:17] i'm working on making all the upload varnish machines into 2gig instead of 1gig [00:58:29] takes more steps than i thought [00:58:38] but at least there's for loops [00:58:38] oh, i didn't realize it was upload in particular [00:58:44] haha [01:00:48] yep [01:00:50] sigh [01:01:13] * jeremyb_ doesn't suppose there's anything he can do [01:01:59] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:02:29] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [01:04:55] !log applying a big change to varnish interface groups - risk higher than normal [01:05:03] Logged the message, Mistress of the network gear. [01:05:14] morebots doesn't care about risk [01:05:14] I am a logbot running on wikitech-static. [01:05:14] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [01:05:14] To log a message, type !log . [01:05:19] damnit [01:05:23] hehehe [01:05:28] hmm, has the puppet been synced? i still can't login to stat1 - no supported authenication methods, publickey [01:05:44] yurik: ssh -vv -> pastebin [01:07:55] cp1032 is standing out from the rest on ganglia. doesn't seem to be any different in site.pp though [01:08:05] i wonder what the difference is [01:08:13] yurik: you're in NYC or SF? [01:08:19] jeremyb_, NYC [01:08:28] yurik: pastebin? [01:08:30] i'm converting puttykey to ssh [01:08:35] sec :) [01:08:43] ohhh, putty [01:08:49] then you can't do ssh -vv :) [01:09:00] you're moving it to a different machine? [01:09:00] once i convert, i should be, right? :)( [01:09:06] putty has logs too [01:09:13] can you give more of those logs? [01:09:28] let me try with openssh [01:09:36] k [01:09:39] also, i will double check that i have the right pubkey submitted [01:09:53] i did it kinda by hand - taking the pub key and removing \n [01:09:55] you could also put that key on labs and try connecting to labs [01:10:06] true that [01:10:14] errr [01:10:26] actually you're not supposed to do that on second though [01:10:28] thought* [01:10:35] labs should be it's own key [01:10:51] (as a policy, not actually enforced) [01:11:04] doesn't matter as much for me because i never forward my agent [01:11:10] but some people do [01:11:22] much easier in labs with constant forwarding :) [01:12:38] hrmmm, yurik's not in wmf yet [01:12:47] he can be my other greg-g guinea pig [01:13:00] oh boy [01:13:05] ok, i got the log [01:13:14] does it have any security stuff? [01:13:22] i can't remember [01:13:39] (for puppet) [01:13:43] err [01:13:45] putty* [01:13:52] doesn't look like its a secret, pasting... [01:14:36] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [01:16:20] jeremyb_, that's a ssh -vv log [01:16:26] PROBLEM - Host cp1036 is DOWN: PING CRITICAL - Packet loss = 100% [01:16:28] yurik: ssh-keygen -lf /c/Users/User/.ssh/id_rsa.pub [01:16:56] RECOVERY - Host cp1036 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [01:16:58] cp1036 should be back up soon [01:16:59] hehe [01:17:28] is your login name really "User"? [01:17:39] and is your password really '... nevermind [01:18:28] ori-l: i mean in the path i gave above. that's literally what it was [01:18:45] jeremyb_, my key file is not in that dir [01:18:54] how do i specify an alternative key? [01:18:57] well where is it? :) [01:18:58] -i [01:19:06] ssh -i path/to/key -vv [01:19:17] jeremyb_, yes, -i is what i used for ssh [01:19:22] what abotu ssh-keygen [01:19:29] need to do conversion [01:19:30] sec [01:20:36] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:22:32] jeremyb_, same key as in puppet [01:22:39] i could paste it here if you want :) [01:22:46] (puppet is public anyway) [01:22:46] yes please :) [01:22:50] AAAAB3NzaC1yc2EAAAABJQAAAgEAks66YFTBrrC9Wv/rPwIf9cTJO1RxsXHMEcWJjosn9fxvUS57KAw2UrCwinu1T1Hng59V+grHxp2wY7Bke3NmYng2OQacH2HKekPFP3fG82OQlj0YRE52deNwlrfBIx7Yg915zpXjXSQi9D5DIncYN/8jE7Q3Shlw0yRfFLmP02zpiX0Vm1d+g8FM0aMaIPR80KlIFSADEYoo2LD9b9gKsIJQ3643geAlzjye7VTr+ojGaPrW7w+tB5ikPgtx8jQnve5UpfKaQHJcdS1of3GNy3/08i+gScog3oxkneBPIW0Wkb3sNwPZ2Y+vxYSIKzO6z/V/HGSNOYQJy7QJRApBav6sKZxdBSPGi3+6vgHxf4IgUVtikJGz [01:22:52] TZ2jtWoqNv/j4h4gfehPkr5hQBJIkJQwTM/JPPbWPGOiWmFQkZeDTsoZGgi5B9hmM3UlelN7egyDZXCEvCirR9moviYI9Dr8VQsT/koyRX3kYdEQV19bHiou+ze6mmKO3OI4EmHkdtR55J1cR3/+7Q8GCAfTiD2KKj7yUEjZMewdOcbZzn29AXkc+90wiuWUWxqan7T5iePRvNPfjHg6ntJDs3tG/WdgF8HluXcWGZHa1Fk2kobK+/WFkGz4CuW9asbUgg+2TOLjvYFzEKgKqS8194nf0WZvRnjy3oFeuj0wwdALmuZDnus= [01:23:06] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:23:20] it starts with ssh-rsa [01:23:25] errmmm, but that broke across lines. and really i want to see what ssh-keygen -lf says too [01:23:39] a yes, sec [01:24:21] jeremyb_, 4096 fe:4e:90:20:3f:45:3d:33:85:56:5b:bf:62:b8:11:c6 pub1 [01:25:08] ok, that matches what i have [01:26:04] New patchset: Lcarr; "making cp1021 to cp1036 into 2gig aggregated machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57237 [01:26:22] icinga says there's been a recent puppet run [01:26:49] jeremyb_, icinga has been lying to its masta [01:26:59] although... [01:27:03] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57237 [01:27:06] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [01:27:42] that puppet run is uncomfortably close to when it was merged [01:27:48] > puppet ran at Wed Apr 3 00:07:42 UTC 2013 [01:28:00] so, 80 mins ago was last run finish [01:28:19] yeah, and merge takes forever... what do you mean 80? isn't it suppose to run every 30 min? [01:29:06] that's my point [01:29:11] let's see if cp1022 survived [01:29:56] LeslieCarr: running puppet on 1022 first? [01:30:06] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:30:36] PROBLEM - Host cp1022 is DOWN: PING CRITICAL - Packet loss = 100% [01:30:47] yeah i did [01:30:48] wtf [01:32:03] New patchset: Lcarr; "Revert "making cp1021 to cp1036 into 2gig aggregated machines"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57238 [01:32:19] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57238 [01:32:36] PROBLEM - RAID on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:39] unsure why it failed [01:32:48] :-/ [01:33:06] PROBLEM - Host cp1029 is DOWN: PING CRITICAL - Packet loss = 100% [01:33:16] PROBLEM - Host cp1033 is DOWN: PING CRITICAL - Packet loss = 100% [01:33:39] that is all me [01:33:51] wtf cp1029 is alive [01:34:06] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [01:34:06] PROBLEM - Host cp1032 is DOWN: PING CRITICAL - Packet loss = 100% [01:34:08] ipv6 is happy, ipv4 is not [01:34:39] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [01:34:40] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [01:34:40] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [01:34:49] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [01:34:49] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [01:34:52] PROBLEM - RAID on wtp1004 is CRITICAL: Timeout while attempting connection [01:34:52] PROBLEM - RAID on wtp1002 is CRITICAL: Timeout while attempting connection [01:34:53] PROBLEM - RAID on wtp1003 is CRITICAL: Timeout while attempting connection [01:35:39] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [01:35:39] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [01:35:49] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [01:35:49] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:09] PROBLEM - Host cp1031 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:09] PROBLEM - RAID on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:39:19] PROBLEM - RAID on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:40:39] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [01:40:49] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.722 second response time [01:41:39] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.753 second response time [01:41:39] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.201 second response time [01:41:39] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.960 second response time [01:41:49] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63158 bytes in 5.856 second response time [01:41:53] jeremyb_, is it totally dead? or just somewhat? :) [01:42:09] PROBLEM - RAID on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:42:26] LeslieCarr: both nics are up [01:42:29] but the bond is down [01:42:33] idk. ops are tied up with upload issues. unless you can bribe Coren to look at the logs [01:42:39] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.987 second response time [01:42:51] nah, no rush [01:43:01] uploads are more important than this :) [01:43:19] PROBLEM - swift-container-replicator on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:19] PROBLEM - swift-object-updater on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:33] yep [01:43:35] ADDRCONF(NETDEV_UP): bond0: link is not ready [01:43:38] hm [01:43:39] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [01:43:39] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [01:43:39] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [01:43:39] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [01:43:39] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [01:43:47] and the mac address is all 00's [01:43:49] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [01:43:59] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:14] New patchset: coren; "Add labstore100[1-4] to dhcp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57239 [01:44:49] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.060 second response time [01:45:09] RECOVERY - swift-container-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [01:45:09] RECOVERY - swift-object-updater on ms-be3 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [01:49:49] RECOVERY - Host cp1022 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [01:50:49] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [01:50:53] New review: Jeremyb; "mixing uppercase and lowercase MACs :( (but already was inconsistent when you started)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57239 [01:54:08] New review: coren; "Simple addition with a review; pushing." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/57239 [01:54:09] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57239 [01:54:48] Ryan_Lane: any ideas ? [01:54:54] hm [01:55:12] LeslieCarr: https://www.kernel.org/doc/Documentation/networking/bonding.txt [01:55:27] there's a section that mentions reasons for a 00:00:00… address [01:55:36] but it's referencing vlans [01:55:52] yeah, so it does appear to make sense that it's waiting until a slave interface joins [01:55:58] now why aren't eth0 and eth1 joining bond0 [01:56:02] indeed [01:56:36] is it just a puppet issue and manually works? or you can't get it to work at all? [01:57:33] LeslieCarr: I just down'd eth1 and up'd it [01:57:44] maybe do the same with eth0? [01:57:47] if I do that it'll kick me out [01:57:52] basically can't get it to work at all [01:57:58] I'm assuming you're connecting via the console [01:58:12] not connected to the conosle of that one [01:58:13] i can though [01:58:25] oh that was weird [01:58:27] cp1033 ... [01:58:34] i had been doing ifdown and ifup a few times [01:58:38] then magically, voila, it comes up [01:58:44] did you do /etc/init.d/networking restart? [01:58:50] RECOVERY - Host cp1033 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [01:58:55] oh yeah, and even tried rebooting the damn box [01:58:58] heh [01:59:12] wtf [02:01:00] should bond-master be set on the devices? [02:01:35] that fixed it on cp1029 [02:01:44] adding bond-master [02:02:26] ah looks like that's not being added [02:02:32] why do you hate the world puppet [02:03:20] heh [02:03:31] LeslieCarr: it did this on virt2 as well [02:03:40] oh god yes [02:03:50] it's not in puppet [02:04:01] that's right [02:04:29] PROBLEM - NTP on cp1022 is CRITICAL: NTP CRITICAL: Offset unknown [02:07:24] ahha think i got it [02:07:43] so add the bond-master, ifdown eth0 ifdown eth1 ifup eth0 ifup eth1 service networking restart [02:07:59] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:08:35] :) [02:08:35] ahha! [02:08:44] these ip route change default via 10.64.0.1 dev eth0 metric 100 initcwnd 10 [02:08:49] RECOVERY - Host cp1029 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [02:08:53] instead of bond0 [02:09:19] RECOVERY - NTP on cp1022 is OK: NTP OK: Offset -0.0003471374512 secs [02:13:49] RECOVERY - Host cp1031 is UP: PING OK - Packet loss = 0%, RTA = 1.64 ms [02:14:07] New patchset: Lcarr; "Revert "Revert "making cp1021 to cp1036 into 2gig aggregated machines""" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57247 [02:14:44] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57247 [02:14:49] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [02:16:29] RECOVERY - Host cp1032 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [02:17:49] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:19:31] !log LocalisationUpdate completed (1.21wmf12) at Wed Apr 3 02:19:30 UTC 2013 [02:19:38] Logged the message, Master [02:20:39] PROBLEM - DPKG on ms-be8 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [02:20:59] PROBLEM - Host cp1022 is DOWN: PING CRITICAL - Packet loss = 100% [02:21:19] PROBLEM - Apache HTTP on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:19] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:21] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:22] PROBLEM - MySQL Slave Running on db1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:29] PROBLEM - MySQL Idle Transactions on db1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:29] PROBLEM - Host cp1025 is DOWN: PING CRITICAL - Packet loss = 100% [02:21:39] PROBLEM - Apache HTTP on mw1210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:39] PROBLEM - Apache HTTP on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:49] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:49] PROBLEM - Apache HTTP on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:49] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:49] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:49] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:56] icinga! [02:22:36] this is going to be a long night [02:22:59] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.514 second response time [02:22:59] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.000 second response time [02:22:59] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.858 second response time [02:22:59] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.205 second response time [02:22:59] RECOVERY - Apache HTTP on mw1215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.074 second response time [02:22:59] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.832 second response time [02:22:59] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.090 second response time [02:23:00] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.180 second response time [02:23:00] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.991 second response time [02:23:01] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.002 second response time [02:23:01] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.073 second response time [02:23:02] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63156 bytes in 1.908 second response time [02:23:02] RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay seconds [02:23:03] RECOVERY - MySQL Slave Delay on db1017 is OK: OK replication delay seconds [02:23:09] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [02:23:09] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [02:23:09] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63158 bytes in 0.216 second response time [02:23:11] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63156 bytes in 0.512 second response time [02:23:14] RECOVERY - MySQL Slave Running on db1017 is OK: OK replication [02:23:19] RECOVERY - MySQL Idle Transactions on db1017 is OK: OK longest blocking idle transaction sleeps for seconds [02:23:29] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [02:23:29] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [02:23:29] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.052 second response time [02:23:39] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.047 second response time [02:23:39] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [02:23:39] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [02:23:39] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [02:23:39] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [02:23:39] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [02:23:39] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [02:23:40] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [02:23:40] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [02:23:41] RECOVERY - Apache HTTP on mw1161 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [02:23:41] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [02:23:42] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [02:23:42] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [02:23:43] RECOVERY - Apache HTTP on mw1186 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [02:23:43] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:23:44] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [02:23:44] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [02:23:45] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [02:23:45] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [02:23:46] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [02:23:46] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:23:47] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [02:23:47] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [02:23:48] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [02:23:48] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:23:59] RECOVERY - LVS HTTP IPv4 on m.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21223 bytes in 0.743 second response time [02:25:39] RECOVERY - Host cp1022 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [02:26:29] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [02:27:29] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 26.15 ms [02:27:59] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [02:28:29] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [02:29:05] heh go icinga [02:31:09] RECOVERY - Host cp1025 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [02:31:45] !log LocalisationUpdate completed (1.22wmf1) at Wed Apr 3 02:31:44 UTC 2013 [02:31:52] Logged the message, Master [02:31:59] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [02:42:22] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [02:44:02] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:44:22] PROBLEM - Host cp1026 is DOWN: PING CRITICAL - Packet loss = 100% [02:44:22] PROBLEM - Host cp1024 is DOWN: PING CRITICAL - Packet loss = 100% [02:45:22] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [02:46:22] PROBLEM - RAID on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:42] PROBLEM - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:46:52] RECOVERY - Host cp1024 is UP: PING OK - Packet loss = 0%, RTA = 21.40 ms [02:48:42] RECOVERY - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 469 bytes in 0.036 second response time [02:50:13] RECOVERY - Host cp1026 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [02:55:02] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [02:58:02] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:59:02] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [03:01:02] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [03:01:42] PROBLEM - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:03:37]