[00:00:01] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:00:44] yurik: well i wouldn't know the difference. for all i know you just filed it yourself [00:00:55] exactly :) [00:01:13] yurik: (or i did for that matter. there are tickets i've modififed that i can't see) [00:01:31] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [00:02:31] PROBLEM - Varnish HTTP upload-backend on cp1021 is CRITICAL: Connection refused [00:03:11] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [00:03:29] RECOVERY - Varnish HTTP upload-backend on cp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.329 second response time [00:03:41] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57210 [00:05:07] !log removed sda3/varnish.persist on cp1021, restarted varnish [00:05:14] Logged the message, Master [00:05:24] andrewbogott, is stat1 accessible directly or i should go through some bastion [00:05:44] stat1 is accessible directly [00:05:47] I believe it has a public IP but probably won't forever. [00:05:49] PROBLEM - DPKG on db1057 is CRITICAL: NRPE: Command check_dpkg not defined [00:05:51] yurik: you should bastion through bast1001 however to get practice [00:05:59] PROBLEM - Disk space on db1057 is CRITICAL: NRPE: Command check_disk_space not defined [00:06:03] stat1001 is the forever public one? [00:06:16] LeslieCarr: no, stat1 is tampa? [00:06:19] PROBLEM - RAID on db1057 is CRITICAL: NRPE: Command check_raid not defined [00:06:32] stat1 is tampa [00:06:38] so fenari [00:06:44] stat1001 is for hosting web apps, stat1 is for number crunching and will lose public ip soonish [00:06:47] well you can ssh through whatever [00:07:01] right, but why bounce around :) [00:07:14] well since most traffic is going via eqiad anyways ;) [00:08:47] ori-l, PHP Fatal error: require() [function.require]: Failed opening required '/usr/local/apache/common-local/php-1.21wmf12/extensions/PostEdit/PostEdit.hooks.php' [00:08:57] and A LOT of it in error log [00:09:19] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:09:28] yurik: anyway, may be 30ish mins before you can get everywhere unless andrew did manual puppet runs [00:09:37] andrewbogott: you merged on sockpuppet? [00:09:44] I did. [00:09:50] ori-l: Looks like 1.21wmf12 has an older version with no hooks file.. [00:10:04] danke :) [00:10:29] New patchset: Ryan Lane; "Don't require a specific version of opendj" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57226 [00:10:31] thanks andrewbogott , jeremyb_ [00:10:37] Reedy: looking, hang on [00:10:46] Neither does 1.22wmf1 [00:11:30] Reedy: but the version I just sync'ed doesn't reference that file [00:11:52] APC cache? [00:11:59] s/ cache// [00:12:39] Maybe. How do I check (and fix, if that's the problem)? [00:13:09] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [00:13:10] They stopped 12 minutes ago [00:13:12] Nothing to do [00:13:23] Just haven't been pushed out of the last 1000 lines due to a lack of other errors ;) [00:13:38] oh, I can always help with that [00:14:58] so, how do i avoid this in the future, if i need to remove a file? remove references to it, sync, and then actually remove it on a subsequent deployment? [00:15:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57226 [00:15:53] ^ Reedy [00:15:59] Seems a bit OTTYeah [00:16:02] blah [00:16:21] I wonder if rsync is removing the hooks file before the loader file has been loaded.. [00:17:19] In which case... Force the loader file first? Then sync-dir... Have an empty file to be deleted? [00:17:42] Or the easiest, don't care [00:17:43] :D [00:18:09] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:18:48] Reedy: from rsync man-page: --delete-after receiver deletes after transfer, not during [00:19:19] Are we using that? [00:19:23] * ori-l checks [00:20:23] nope [00:21:01] Sounds like it might be a good enhancement then [00:21:17] Reedy: there's also: --delete-delay find deletions during, delete after [00:21:29] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:25:24] anomie|away, did you deploy your fix? i was curious to see the process [00:25:57] yurik- Yeah, over an hour ago. What did you want to see? [00:26:14] just curious what steps are needed to do a depl like that [00:26:24] i'm sure i will have plenty of OMG bugs [00:28:09] yurik- Step 0 is talking to people (such as greg-g (sorry for the ping)). Then basically follow https://wikitech.wikimedia.org/wiki/How_to_deploy_code [00:29:14] right, so its similar to what max was showing today for the regular mobile frontend deployment. Will need to get a fenari account at some point [00:29:24] thx [00:30:00] TimStarling: ^ quick sanity check on that idea? (that is, using '--delay-updates --delete-delay' in sync-common-file to prevent brief but potentially harmful inconsistencies) [00:30:35] i'm worried that the flags are already set in some config file that i didn't know to look up [00:31:20] some useful stuff about delete-delay in this message and follow-ups: http://lists.samba.org/archive/rsync/2008-June/021107.html [00:32:29] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [00:37:46] ori-l: I think there might be a couple of places you might need to do it.. But shouldn't be hidden [00:38:22] file/scap [00:39:29] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:43:09] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [00:44:25] * jeremyb_ repeats: what to do with 4785/4685? [00:44:35] andrewbogott: whatchya think? [00:45:59] * jeremyb_ wonders if RT assumes that no one will ever make a mistake... [00:46:33] jeremyb, am I looking at what you're looking at? email for echo? redirect for wikimaps? [00:46:50] andrewbogott: look at the last 3 msgs on wikimaps [00:47:33] * jeremyb_ RT fu is too weak for this situation :P [00:48:26] Hm… probably best to bug mark about those in the morning… I have neither an opinion nor relevant skills :) [00:49:10] i was thinking merge and make a new wikimaps [00:53:09] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:54:59] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [00:56:47] i just got a 503 from upload [00:58:02] yeah, there's network problems, machines keep maxing out [00:58:17] i'm working on making all the upload varnish machines into 2gig instead of 1gig [00:58:29] takes more steps than i thought [00:58:38] but at least there's for loops [00:58:38] oh, i didn't realize it was upload in particular [00:58:44] haha [01:00:48] yep [01:00:50] sigh [01:01:13] * jeremyb_ doesn't suppose there's anything he can do [01:01:59] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:02:29] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [01:04:55] !log applying a big change to varnish interface groups - risk higher than normal [01:05:03] Logged the message, Mistress of the network gear. [01:05:14] morebots doesn't care about risk [01:05:14] I am a logbot running on wikitech-static. [01:05:14] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [01:05:14] To log a message, type !log . [01:05:19] damnit [01:05:23] hehehe [01:05:28] hmm, has the puppet been synced? i still can't login to stat1 - no supported authenication methods, publickey [01:05:44] yurik: ssh -vv -> pastebin [01:07:55] cp1032 is standing out from the rest on ganglia. doesn't seem to be any different in site.pp though [01:08:05] i wonder what the difference is [01:08:13] yurik: you're in NYC or SF? [01:08:19] jeremyb_, NYC [01:08:28] yurik: pastebin? [01:08:30] i'm converting puttykey to ssh [01:08:35] sec :) [01:08:43] ohhh, putty [01:08:49] then you can't do ssh -vv :) [01:09:00] you're moving it to a different machine? [01:09:00] once i convert, i should be, right? :)( [01:09:06] putty has logs too [01:09:13] can you give more of those logs? [01:09:28] let me try with openssh [01:09:36] k [01:09:39] also, i will double check that i have the right pubkey submitted [01:09:53] i did it kinda by hand - taking the pub key and removing \n [01:09:55] you could also put that key on labs and try connecting to labs [01:10:06] true that [01:10:14] errr [01:10:26] actually you're not supposed to do that on second though [01:10:28] thought* [01:10:35] labs should be it's own key [01:10:51] (as a policy, not actually enforced) [01:11:04] doesn't matter as much for me because i never forward my agent [01:11:10] but some people do [01:11:22] much easier in labs with constant forwarding :) [01:12:38] hrmmm, yurik's not in wmf yet [01:12:47] he can be my other greg-g guinea pig [01:13:00] oh boy [01:13:05] ok, i got the log [01:13:14] does it have any security stuff? [01:13:22] i can't remember [01:13:39] (for puppet) [01:13:43] err [01:13:45] putty* [01:13:52] doesn't look like its a secret, pasting... [01:14:36] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [01:16:20] jeremyb_, that's a ssh -vv log [01:16:26] PROBLEM - Host cp1036 is DOWN: PING CRITICAL - Packet loss = 100% [01:16:28] yurik: ssh-keygen -lf /c/Users/User/.ssh/id_rsa.pub [01:16:56] RECOVERY - Host cp1036 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [01:16:58] cp1036 should be back up soon [01:16:59] hehe [01:17:28] is your login name really "User"? [01:17:39] and is your password really '... nevermind [01:18:28] ori-l: i mean in the path i gave above. that's literally what it was [01:18:45] jeremyb_, my key file is not in that dir [01:18:54] how do i specify an alternative key? [01:18:57] well where is it? :) [01:18:58] -i [01:19:06] ssh -i path/to/key -vv [01:19:17] jeremyb_, yes, -i is what i used for ssh [01:19:22] what abotu ssh-keygen [01:19:29] need to do conversion [01:19:30] sec [01:20:36] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:22:32] jeremyb_, same key as in puppet [01:22:39] i could paste it here if you want :) [01:22:46] (puppet is public anyway) [01:22:46] yes please :) [01:22:50] AAAAB3NzaC1yc2EAAAABJQAAAgEAks66YFTBrrC9Wv/rPwIf9cTJO1RxsXHMEcWJjosn9fxvUS57KAw2UrCwinu1T1Hng59V+grHxp2wY7Bke3NmYng2OQacH2HKekPFP3fG82OQlj0YRE52deNwlrfBIx7Yg915zpXjXSQi9D5DIncYN/8jE7Q3Shlw0yRfFLmP02zpiX0Vm1d+g8FM0aMaIPR80KlIFSADEYoo2LD9b9gKsIJQ3643geAlzjye7VTr+ojGaPrW7w+tB5ikPgtx8jQnve5UpfKaQHJcdS1of3GNy3/08i+gScog3oxkneBPIW0Wkb3sNwPZ2Y+vxYSIKzO6z/V/HGSNOYQJy7QJRApBav6sKZxdBSPGi3+6vgHxf4IgUVtikJGz [01:22:52] TZ2jtWoqNv/j4h4gfehPkr5hQBJIkJQwTM/JPPbWPGOiWmFQkZeDTsoZGgi5B9hmM3UlelN7egyDZXCEvCirR9moviYI9Dr8VQsT/koyRX3kYdEQV19bHiou+ze6mmKO3OI4EmHkdtR55J1cR3/+7Q8GCAfTiD2KKj7yUEjZMewdOcbZzn29AXkc+90wiuWUWxqan7T5iePRvNPfjHg6ntJDs3tG/WdgF8HluXcWGZHa1Fk2kobK+/WFkGz4CuW9asbUgg+2TOLjvYFzEKgKqS8194nf0WZvRnjy3oFeuj0wwdALmuZDnus= [01:23:06] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:23:20] it starts with ssh-rsa [01:23:25] errmmm, but that broke across lines. and really i want to see what ssh-keygen -lf says too [01:23:39] a yes, sec [01:24:21] jeremyb_, 4096 fe:4e:90:20:3f:45:3d:33:85:56:5b:bf:62:b8:11:c6 pub1 [01:25:08] ok, that matches what i have [01:26:04] New patchset: Lcarr; "making cp1021 to cp1036 into 2gig aggregated machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57237 [01:26:22] icinga says there's been a recent puppet run [01:26:49] jeremyb_, icinga has been lying to its masta [01:26:59] although... [01:27:03] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57237 [01:27:06] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [01:27:42] that puppet run is uncomfortably close to when it was merged [01:27:48] > puppet ran at Wed Apr 3 00:07:42 UTC 2013 [01:28:00] so, 80 mins ago was last run finish [01:28:19] yeah, and merge takes forever... what do you mean 80? isn't it suppose to run every 30 min? [01:29:06] that's my point [01:29:11] let's see if cp1022 survived [01:29:56] LeslieCarr: running puppet on 1022 first? [01:30:06] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:30:36] PROBLEM - Host cp1022 is DOWN: PING CRITICAL - Packet loss = 100% [01:30:47] yeah i did [01:30:48] wtf [01:32:03] New patchset: Lcarr; "Revert "making cp1021 to cp1036 into 2gig aggregated machines"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57238 [01:32:19] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57238 [01:32:36] PROBLEM - RAID on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:39] unsure why it failed [01:32:48] :-/ [01:33:06] PROBLEM - Host cp1029 is DOWN: PING CRITICAL - Packet loss = 100% [01:33:16] PROBLEM - Host cp1033 is DOWN: PING CRITICAL - Packet loss = 100% [01:33:39] that is all me [01:33:51] wtf cp1029 is alive [01:34:06] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [01:34:06] PROBLEM - Host cp1032 is DOWN: PING CRITICAL - Packet loss = 100% [01:34:08] ipv6 is happy, ipv4 is not [01:34:39] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [01:34:40] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [01:34:40] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [01:34:49] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [01:34:49] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [01:34:52] PROBLEM - RAID on wtp1004 is CRITICAL: Timeout while attempting connection [01:34:52] PROBLEM - RAID on wtp1002 is CRITICAL: Timeout while attempting connection [01:34:53] PROBLEM - RAID on wtp1003 is CRITICAL: Timeout while attempting connection [01:35:39] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [01:35:39] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [01:35:49] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [01:35:49] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:09] PROBLEM - Host cp1031 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:09] PROBLEM - RAID on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:39:19] PROBLEM - RAID on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:40:39] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [01:40:49] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.722 second response time [01:41:39] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.753 second response time [01:41:39] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.201 second response time [01:41:39] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.960 second response time [01:41:49] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63158 bytes in 5.856 second response time [01:41:53] jeremyb_, is it totally dead? or just somewhat? :) [01:42:09] PROBLEM - RAID on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:42:26] LeslieCarr: both nics are up [01:42:29] but the bond is down [01:42:33] idk. ops are tied up with upload issues. unless you can bribe Coren to look at the logs [01:42:39] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.987 second response time [01:42:51] nah, no rush [01:43:01] uploads are more important than this :) [01:43:19] PROBLEM - swift-container-replicator on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:19] PROBLEM - swift-object-updater on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:33] yep [01:43:35] ADDRCONF(NETDEV_UP): bond0: link is not ready [01:43:38] hm [01:43:39] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [01:43:39] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [01:43:39] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [01:43:39] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [01:43:39] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [01:43:47] and the mac address is all 00's [01:43:49] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [01:43:59] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:14] New patchset: coren; "Add labstore100[1-4] to dhcp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57239 [01:44:49] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.060 second response time [01:45:09] RECOVERY - swift-container-replicator on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [01:45:09] RECOVERY - swift-object-updater on ms-be3 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [01:49:49] RECOVERY - Host cp1022 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [01:50:49] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [01:50:53] New review: Jeremyb; "mixing uppercase and lowercase MACs :( (but already was inconsistent when you started)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57239 [01:54:08] New review: coren; "Simple addition with a review; pushing." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/57239 [01:54:09] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57239 [01:54:48] Ryan_Lane: any ideas ? [01:54:54] hm [01:55:12] LeslieCarr: https://www.kernel.org/doc/Documentation/networking/bonding.txt [01:55:27] there's a section that mentions reasons for a 00:00:00… address [01:55:36] but it's referencing vlans [01:55:52] yeah, so it does appear to make sense that it's waiting until a slave interface joins [01:55:58] now why aren't eth0 and eth1 joining bond0 [01:56:02] indeed [01:56:36] is it just a puppet issue and manually works? or you can't get it to work at all? [01:57:33] LeslieCarr: I just down'd eth1 and up'd it [01:57:44] maybe do the same with eth0? [01:57:47] if I do that it'll kick me out [01:57:52] basically can't get it to work at all [01:57:58] I'm assuming you're connecting via the console [01:58:12] not connected to the conosle of that one [01:58:13] i can though [01:58:25] oh that was weird [01:58:27] cp1033 ... [01:58:34] i had been doing ifdown and ifup a few times [01:58:38] then magically, voila, it comes up [01:58:44] did you do /etc/init.d/networking restart? [01:58:50] RECOVERY - Host cp1033 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [01:58:55] oh yeah, and even tried rebooting the damn box [01:58:58] heh [01:59:12] wtf [02:01:00] should bond-master be set on the devices? [02:01:35] that fixed it on cp1029 [02:01:44] adding bond-master [02:02:26] ah looks like that's not being added [02:02:32] why do you hate the world puppet [02:03:20] heh [02:03:31] LeslieCarr: it did this on virt2 as well [02:03:40] oh god yes [02:03:50] it's not in puppet [02:04:01] that's right [02:04:29] PROBLEM - NTP on cp1022 is CRITICAL: NTP CRITICAL: Offset unknown [02:07:24] ahha think i got it [02:07:43] so add the bond-master, ifdown eth0 ifdown eth1 ifup eth0 ifup eth1 service networking restart [02:07:59] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:08:35] :) [02:08:35] ahha! [02:08:44] these ip route change default via 10.64.0.1 dev eth0 metric 100 initcwnd 10 [02:08:49] RECOVERY - Host cp1029 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [02:08:53] instead of bond0 [02:09:19] RECOVERY - NTP on cp1022 is OK: NTP OK: Offset -0.0003471374512 secs [02:13:49] RECOVERY - Host cp1031 is UP: PING OK - Packet loss = 0%, RTA = 1.64 ms [02:14:07] New patchset: Lcarr; "Revert "Revert "making cp1021 to cp1036 into 2gig aggregated machines""" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57247 [02:14:44] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57247 [02:14:49] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [02:16:29] RECOVERY - Host cp1032 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [02:17:49] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:19:31] !log LocalisationUpdate completed (1.21wmf12) at Wed Apr 3 02:19:30 UTC 2013 [02:19:38] Logged the message, Master [02:20:39] PROBLEM - DPKG on ms-be8 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [02:20:59] PROBLEM - Host cp1022 is DOWN: PING CRITICAL - Packet loss = 100% [02:21:19] PROBLEM - Apache HTTP on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:19] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:21] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:22] PROBLEM - MySQL Slave Running on db1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:29] PROBLEM - MySQL Idle Transactions on db1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:29] PROBLEM - Host cp1025 is DOWN: PING CRITICAL - Packet loss = 100% [02:21:39] PROBLEM - Apache HTTP on mw1210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:39] PROBLEM - Apache HTTP on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:49] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:49] PROBLEM - Apache HTTP on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:49] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:49] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:49] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:56] icinga! [02:22:36] this is going to be a long night [02:22:59] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.514 second response time [02:22:59] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.000 second response time [02:22:59] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.858 second response time [02:22:59] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.205 second response time [02:22:59] RECOVERY - Apache HTTP on mw1215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.074 second response time [02:22:59] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.832 second response time [02:22:59] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.090 second response time [02:23:00] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.180 second response time [02:23:00] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.991 second response time [02:23:01] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.002 second response time [02:23:01] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.073 second response time [02:23:02] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63156 bytes in 1.908 second response time [02:23:02] RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay seconds [02:23:03] RECOVERY - MySQL Slave Delay on db1017 is OK: OK replication delay seconds [02:23:09] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [02:23:09] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [02:23:09] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63158 bytes in 0.216 second response time [02:23:11] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63156 bytes in 0.512 second response time [02:23:14] RECOVERY - MySQL Slave Running on db1017 is OK: OK replication [02:23:19] RECOVERY - MySQL Idle Transactions on db1017 is OK: OK longest blocking idle transaction sleeps for seconds [02:23:29] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [02:23:29] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [02:23:29] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.052 second response time [02:23:39] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.047 second response time [02:23:39] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [02:23:39] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [02:23:39] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [02:23:39] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [02:23:39] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [02:23:39] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [02:23:40] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [02:23:40] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [02:23:41] RECOVERY - Apache HTTP on mw1161 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [02:23:41] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [02:23:42] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [02:23:42] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [02:23:43] RECOVERY - Apache HTTP on mw1186 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [02:23:43] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:23:44] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [02:23:44] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [02:23:45] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [02:23:45] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [02:23:46] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [02:23:46] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:23:47] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [02:23:47] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [02:23:48] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [02:23:48] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:23:59] RECOVERY - LVS HTTP IPv4 on m.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21223 bytes in 0.743 second response time [02:25:39] RECOVERY - Host cp1022 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [02:26:29] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [02:27:29] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 26.15 ms [02:27:59] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [02:28:29] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [02:29:05] heh go icinga [02:31:09] RECOVERY - Host cp1025 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [02:31:45] !log LocalisationUpdate completed (1.22wmf1) at Wed Apr 3 02:31:44 UTC 2013 [02:31:52] Logged the message, Master [02:31:59] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [02:42:22] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [02:44:02] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:44:22] PROBLEM - Host cp1026 is DOWN: PING CRITICAL - Packet loss = 100% [02:44:22] PROBLEM - Host cp1024 is DOWN: PING CRITICAL - Packet loss = 100% [02:45:22] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [02:46:22] PROBLEM - RAID on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:42] PROBLEM - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:46:52] RECOVERY - Host cp1024 is UP: PING OK - Packet loss = 0%, RTA = 21.40 ms [02:48:42] RECOVERY - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 469 bytes in 0.036 second response time [02:50:13] RECOVERY - Host cp1026 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [02:55:02] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [02:58:02] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:59:02] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [03:01:02] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [03:01:42] PROBLEM - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:03:37] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [03:03:37] RECOVERY - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 712 bytes in 0.186 second response time [03:05:30] !log the final upload varnish is on 2gig instead of 1gig ! win! [03:05:37] Logged the message, Mistress of the network gear. [03:06:07] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [03:09:36] LeslieCarr: first graph - http://gdash.wikimedia.org/dashboards/reqerror/ [03:10:02] current: for both looks a lot better [03:11:19] cool [03:12:44] some machines are screwy in ganglia [03:12:47] PROBLEM - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:13:04] like how do you have a single machine with over a petabit of network capacity? [03:13:10] also, icinga ^ [03:13:37] RECOVERY - LVS HTTP IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 712 bytes in 0.006 second response time [03:14:34] only ipv6 has been flapping most recently [03:15:30] LeslieCarr: any chance of ipv6 specific issues with bonding? [03:15:55] oh [03:16:03] oh! yes, need to rerun puppet is a possibility [03:16:07] lemme do that [03:16:10] (thank you salt!) [03:16:31] maybe that will fix the ganglia crazyness [03:16:43] and what about cp1021? ganglia says it's down [03:17:12] the ganglia petabyte craziness isn't really possible [03:17:15] to fix [03:19:56] Who do I talk to tomorrow if I want a pair of eyes to see if a server is actually physically wired? :-) [03:20:55] holy shit [03:21:03] cp1028 is already re-maxing itself out ? [03:21:04] wtf ? [03:23:09] Coren: just drop a ticket tonight and maybe it will already have been looked when you get up :) [03:23:49] eqiad or pmtpa queueu [03:24:37] Coren: this is one of your new ones? must be eqiad based on dhcp conf [03:25:02] eqiad [03:25:36] so drop a ticket in eqiad :) [03:25:54] On my way to RT now. :-) [03:29:07] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:32:07] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [03:37:08] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:38:14] New patchset: MZMcBride; "Install lilypond on Apache nodes (used by Score extension)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56577 [03:52:57] yurik: still there? [03:53:09] New review: Krinkle; "I heard some talk recently that sounded like it was basically implementing what this commit does." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15561 [03:53:13] things are getting kinda quiet in opsland :) [03:54:32] Krinkle: the new way is kinda the reverse of that. but not quite ready to do yet because waiting a bit on legal/paperwork [03:55:01] jeremyb_, yep [03:55:13] and still can't connect :) [03:55:50] someone want to check the logs for yurik ? or you're all going to get beer immediately? :-) [03:56:14] puppetd log @ stat1.wikimedia.org (and auth.log too i guess) [03:59:40] yurik: i just dealt with broken site, i'm sorry but not today [03:59:44] not unless it's site broken [03:59:51] it's not! [03:59:59] it's new access that's not working yet [04:00:46] then no [04:00:54] tomorrow [04:01:07] today i am going to sign off soon as i am sure that nothing will explode again [04:01:08] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [04:01:44] right :) [04:03:38] robla missed the party [04:03:51] the cake was a lie [04:04:05] party? [04:04:45] somehow I'm guessing that's a euphemism for something less than a party. just guessin' [04:05:32] perhaps scroll up and notice a pattern ? [04:05:44] of upload failures [04:07:30] LeslieCarr++ [04:07:52] i'm out until the phone pages again [04:07:52] bye [04:08:07] see ya [04:08:08] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [04:08:14] * robla continues to read the backlog [04:09:12] jeremyb_: "err: Failed to apply catalog: Parameter key failed: Key must not contain whitespace" [04:09:27] followed by yurik's key, whitespace-free as far as i can tell [04:10:03] hmmmm [04:11:08] (hi, robla) [04:11:20] howdy ori-l [04:11:55] I gather there was a configuration change which broke a pybal check on the upload varnishes [04:12:09] which caused half of them to be depooled, and then the other half were overloaded [04:12:38] the check was fixed but didn't take effect because pybal hadn't been restarted [04:13:29] LeslieCarr, absolutelly no rush on that one :) [04:13:39] yurik: she left already [04:13:50] hehe, i stepped away for a sec [04:14:10] ori-l: do you have line #s, etc. ? can you pastebin? [04:14:44] jeremyb_: the key is surrounded by double quotes, which puppet will parse to interpolate strings [04:14:58] i know [04:15:00] i thought about that [04:15:01] there's no '$' in the key but i wonder if there is some other sequence that is tripping up puppet [04:15:04] but there's no $ in it [04:15:08] i have an idea [04:15:23] * ori-l waits for it [04:17:12] ori-l: how are you reading this anyway? you have sudo there? [04:17:47] the last time yurik encountered some baffling and mysterious bug it came down to CRLFs, i bet you $5 this is some funky dos<->unix issue too [04:18:01] jeremyb_: no. /var/log/puppet is root-only but /var/log/puppet.log is not [04:18:12] ahhhhh [04:18:13] cool [04:18:26] do you see handrade in the log? [04:19:52] Special:BannerRandom is 10% of our apache request rate [04:21:22] jeremyb_: oh, i found the issue [04:21:22] sec [04:21:22] ori-l: ; vs. , ? [04:23:55] New patchset: Ori.livneh; "Fix extra characters in bblack's SSH key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57258 [04:24:05] ^ that [04:24:24] hahahaha [04:24:38] redmond sadly exculpated [04:24:41] the one i didn't bother reviewing was the problem [04:25:21] that should have caused puppet issues everywhere though [04:25:28] lets see what happens after it's merged :) [04:27:40] New patchset: Ori.livneh; "Trim extraneous " ssh-rsa " from SSH key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57258 [04:28:12] ugh, I keep forgetting that gerrit doesn't wrap lines if you use the web UI to edit a commit [04:30:57] New patchset: Ori.livneh; "Trim extraneous " ssh-rsa " from SSH key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57258 [04:30:59] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [04:31:39] New patchset: Pyoungmeister; "setting ram to actually possible levels for the sanitarium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57259 [04:32:02] ori-l: you changed it to a comma? [04:32:11] no, it wasn't the semicolon [04:32:14] look again [04:32:37] i'm telling you, you changed it to a comma [04:32:53] yes, but that's gratuit [04:33:04] ok. i wondered if there was a reason [04:34:23] just good style [04:35:05] and you have a free dot to spend elsewhere [04:52:35] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [05:00:35] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [05:13:30] New review: Yurik; "did anyone even look at the previous patch? :)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57258 [05:15:45] yurik: 03 04:24:41 < jeremyb_> the one i didn't bother reviewing was the problem [05:24:33] New review: Jeremyb; "fu I8d624af5b7f2565064116" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57199 [05:25:28] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [05:26:00] New patchset: Jeremyb; "Trim extraneous " ssh-rsa " from SSH key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57258 [05:30:28] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [06:08:57] <^demon> !log gerrit: finished doing jgit gc on mediawiki/core. Repo size 3G -> 323M. Fresh clone time to <5m to localhost, <1.5m to other hosts inside wmf. I rock. Backup's in /home/demon/core.git in case something goes wrong. Bed time. [06:09:05] Logged the message, Master [06:14:50] Coren: the way to find a physical location is racktables. idk if you have access yet [06:36:13] PROBLEM - Puppet freshness on srv287 is CRITICAL: Puppet has not run in the last 10 hours [06:46:13] PROBLEM - Puppet freshness on analytics1009 is CRITICAL: Puppet has not run in the last 10 hours [06:46:13] PROBLEM - Puppet freshness on mw74 is CRITICAL: Puppet has not run in the last 10 hours [06:46:13] PROBLEM - Puppet freshness on mw1084 is CRITICAL: Puppet has not run in the last 10 hours [06:46:13] PROBLEM - Puppet freshness on mw1049 is CRITICAL: Puppet has not run in the last 10 hours [06:46:53] PROBLEM - DPKG on ms-be4 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:47:13] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [06:47:13] PROBLEM - Puppet freshness on mw1047 is CRITICAL: Puppet has not run in the last 10 hours [06:47:13] PROBLEM - Puppet freshness on mw1105 is CRITICAL: Puppet has not run in the last 10 hours [06:47:13] PROBLEM - Puppet freshness on mw1075 is CRITICAL: Puppet has not run in the last 10 hours [06:47:13] PROBLEM - Puppet freshness on mw1137 is CRITICAL: Puppet has not run in the last 10 hours [06:47:14] PROBLEM - Puppet freshness on mw1165 is CRITICAL: Puppet has not run in the last 10 hours [06:47:14] PROBLEM - Puppet freshness on mw1179 is CRITICAL: Puppet has not run in the last 10 hours [06:47:15] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [06:47:15] PROBLEM - Puppet freshness on mw45 is CRITICAL: Puppet has not run in the last 10 hours [06:47:16] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [06:47:16] PROBLEM - Puppet freshness on search1005 is CRITICAL: Puppet has not run in the last 10 hours [06:47:17] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [06:47:17] PROBLEM - Puppet freshness on search1015 is CRITICAL: Puppet has not run in the last 10 hours [06:47:18] PROBLEM - Puppet freshness on srv291 is CRITICAL: Puppet has not run in the last 10 hours [06:47:33] PROBLEM - DPKG on ms-be2 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:48:03] PROBLEM - DPKG on ms-be9 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:48:13] PROBLEM - Puppet freshness on mw1078 is CRITICAL: Puppet has not run in the last 10 hours [06:48:13] PROBLEM - Puppet freshness on mw1094 is CRITICAL: Puppet has not run in the last 10 hours [06:48:13] PROBLEM - Puppet freshness on mw1169 is CRITICAL: Puppet has not run in the last 10 hours [06:48:13] PROBLEM - Puppet freshness on mw1030 is CRITICAL: Puppet has not run in the last 10 hours [06:48:13] PROBLEM - Puppet freshness on mw1136 is CRITICAL: Puppet has not run in the last 10 hours [06:48:14] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Puppet has not run in the last 10 hours [06:48:14] PROBLEM - Puppet freshness on mw1198 is CRITICAL: Puppet has not run in the last 10 hours [06:48:15] PROBLEM - Puppet freshness on mw39 is CRITICAL: Puppet has not run in the last 10 hours [06:48:15] PROBLEM - Puppet freshness on mw1181 is CRITICAL: Puppet has not run in the last 10 hours [06:48:16] PROBLEM - Puppet freshness on analytics1008 is CRITICAL: Puppet has not run in the last 10 hours [06:48:16] PROBLEM - Puppet freshness on mw95 is CRITICAL: Puppet has not run in the last 10 hours [06:48:17] PROBLEM - Puppet freshness on srv272 is CRITICAL: Puppet has not run in the last 10 hours [06:48:17] PROBLEM - Puppet freshness on srv264 is CRITICAL: Puppet has not run in the last 10 hours [06:48:18] PROBLEM - Puppet freshness on mw99 is CRITICAL: Puppet has not run in the last 10 hours [06:48:18] PROBLEM - Puppet freshness on wtp1002 is CRITICAL: Puppet has not run in the last 10 hours [06:48:33] PROBLEM - DPKG on ms-be1 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:49:13] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: Puppet has not run in the last 10 hours [06:49:13] PROBLEM - Puppet freshness on mw1074 is CRITICAL: Puppet has not run in the last 10 hours [06:49:13] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Puppet has not run in the last 10 hours [06:49:13] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: Puppet has not run in the last 10 hours [06:49:13] PROBLEM - Puppet freshness on mw36 is CRITICAL: Puppet has not run in the last 10 hours [06:49:14] PROBLEM - Puppet freshness on mw1138 is CRITICAL: Puppet has not run in the last 10 hours [06:49:14] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [06:49:15] PROBLEM - Puppet freshness on mw1116 is CRITICAL: Puppet has not run in the last 10 hours [06:49:15] PROBLEM - Puppet freshness on mw26 is CRITICAL: Puppet has not run in the last 10 hours [06:49:16] PROBLEM - Puppet freshness on mw117 is CRITICAL: Puppet has not run in the last 10 hours [06:49:16] PROBLEM - Puppet freshness on mw64 is CRITICAL: Puppet has not run in the last 10 hours [06:49:17] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Puppet has not run in the last 10 hours [06:49:17] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [06:49:18] PROBLEM - Puppet freshness on search1023 is CRITICAL: Puppet has not run in the last 10 hours [06:50:13] PROBLEM - Puppet freshness on mw1022 is CRITICAL: Puppet has not run in the last 10 hours [06:50:13] PROBLEM - Puppet freshness on mw1040 is CRITICAL: Puppet has not run in the last 10 hours [06:50:13] PROBLEM - Puppet freshness on mw1062 is CRITICAL: Puppet has not run in the last 10 hours [06:50:13] PROBLEM - Puppet freshness on mw1218 is CRITICAL: Puppet has not run in the last 10 hours [06:50:13] PROBLEM - Puppet freshness on mw107 is CRITICAL: Puppet has not run in the last 10 hours [06:50:14] PROBLEM - Puppet freshness on mw53 is CRITICAL: Puppet has not run in the last 10 hours [06:50:14] PROBLEM - Puppet freshness on mw1185 is CRITICAL: Puppet has not run in the last 10 hours [06:50:15] PROBLEM - Puppet freshness on snapshot4 is CRITICAL: Puppet has not run in the last 10 hours [06:50:15] PROBLEM - Puppet freshness on srv296 is CRITICAL: Puppet has not run in the last 10 hours [06:51:13] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [06:51:13] PROBLEM - Puppet freshness on mw1072 is CRITICAL: Puppet has not run in the last 10 hours [06:51:13] PROBLEM - Puppet freshness on mw1132 is CRITICAL: Puppet has not run in the last 10 hours [06:51:13] PROBLEM - Puppet freshness on mw111 is CRITICAL: Puppet has not run in the last 10 hours [06:51:13] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:51:14] PROBLEM - Puppet freshness on mw1219 is CRITICAL: Puppet has not run in the last 10 hours [06:51:14] PROBLEM - Puppet freshness on mw1178 is CRITICAL: Puppet has not run in the last 10 hours [06:51:15] PROBLEM - Puppet freshness on mw33 is CRITICAL: Puppet has not run in the last 10 hours [06:51:15] PROBLEM - Puppet freshness on mw87 is CRITICAL: Puppet has not run in the last 10 hours [06:51:16] PROBLEM - Puppet freshness on srv285 is CRITICAL: Puppet has not run in the last 10 hours [06:52:13] PROBLEM - Puppet freshness on caesium is CRITICAL: Puppet has not run in the last 10 hours [06:52:13] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [06:52:13] PROBLEM - Puppet freshness on mw1045 is CRITICAL: Puppet has not run in the last 10 hours [06:52:14] PROBLEM - Puppet freshness on mw1001 is CRITICAL: Puppet has not run in the last 10 hours [06:52:14] PROBLEM - Puppet freshness on mw1080 is CRITICAL: Puppet has not run in the last 10 hours [06:54:13] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Puppet has not run in the last 10 hours [06:54:13] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: Puppet has not run in the last 10 hours [06:54:13] PROBLEM - Puppet freshness on mw1108 is CRITICAL: Puppet has not run in the last 10 hours [06:54:13] PROBLEM - Puppet freshness on mw1120 is CRITICAL: Puppet has not run in the last 10 hours [06:54:13] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [06:54:13] PROBLEM - Puppet freshness on mw1073 is CRITICAL: Puppet has not run in the last 10 hours [06:54:13] PROBLEM - Puppet freshness on mw1008 is CRITICAL: Puppet has not run in the last 10 hours [06:54:14] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [06:54:14] PROBLEM - Puppet freshness on mw1082 is CRITICAL: Puppet has not run in the last 10 hours [06:54:15] PROBLEM - Puppet freshness on srv243 is CRITICAL: Puppet has not run in the last 10 hours [06:54:15] PROBLEM - Puppet freshness on srv293 is CRITICAL: Puppet has not run in the last 10 hours [06:54:16] PROBLEM - Puppet freshness on mw1199 is CRITICAL: Puppet has not run in the last 10 hours [06:54:16] PROBLEM - Puppet freshness on mw25 is CRITICAL: Puppet has not run in the last 10 hours [06:54:17] PROBLEM - Puppet freshness on mw1009 is CRITICAL: Puppet has not run in the last 10 hours [06:54:17] PROBLEM - Puppet freshness on srv251 is CRITICAL: Puppet has not run in the last 10 hours [06:54:18] PROBLEM - Puppet freshness on mw1148 is CRITICAL: Puppet has not run in the last 10 hours [06:54:18] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [06:54:19] PROBLEM - Puppet freshness on ocg2 is CRITICAL: Puppet has not run in the last 10 hours [06:55:13] PROBLEM - Puppet freshness on mw1060 is CRITICAL: Puppet has not run in the last 10 hours [06:55:13] PROBLEM - Puppet freshness on mw1203 is CRITICAL: Puppet has not run in the last 10 hours [07:02:52] RECOVERY - DPKG on ms-be4 is OK: All packages OK [07:05:14] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57258 [07:05:32] RECOVERY - DPKG on ms-be2 is OK: All packages OK [07:07:29] apergos: around? [07:07:38] yes [07:08:07] paravoid: [07:08:43] hey [07:10:02] RECOVERY - DPKG on ms-be9 is OK: All packages OK [07:12:53] so, how's the C2100 replacement going? [07:12:55] 4 boxes left, right/ [07:13:03] hi, any brave soul to give some guidance on how to run puppet apply on labs (and tweak .pp file to actually run?) I already got the selfhosted puppet up, but don't know how to decypher the criptic messages like Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type monitor_group at /var/lib/git/operations/puppet/manifests/varnish.pp:3 on node mobile-varnish.pmtpa.wmf [07:13:03] yes, slow but steady [07:13:04] labs [07:13:38] yurik: apply won't work [07:13:58] paravoid, i thought if i change the .pp file a bit, i can make it run on labs? [07:14:24] my ultimate goal - get a copy of all mobile varnish config on the labs instance [07:14:33] so if there is an easier way, i'm all for it :) [07:14:49] it doesn't have to be pupetizable [07:14:52] self-hosted puppet is the way to go [07:14:57] i already got that [07:15:02] this runs a local puppetmaster [07:15:06] hello [07:15:08] that you then run the agent against [07:15:38] i got the selfhosted on the instance from https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [07:15:54] how do i run just the varnish stuff? [07:16:04] i might also need cache [07:16:08] not sure [07:16:10] hi hashar [07:16:16] i heard you are a guru in this :) [07:16:33] RECOVERY - DPKG on ms-be1 is OK: All packages OK [07:17:26] yurik, I actually think that this work could be done inside the ops team [07:17:34] I mean, we won't stop you if you want to do all that [07:18:03] the varnish work I mean [07:18:25] paravoid, i have no objections if someone else could do it :) moreover, its the best course of events :) its just that i need to start developing, and need to have some sort of a testing rig [07:19:15] I was talking about the carrier ip thing [07:19:16] paravoid, do you know of any estimated timelines for this, so that i can tell dfoy to wait with the varnish stuff? [07:19:53] paravoid, are you saying you would rather do your own geoip-like db encoding? [07:19:59] and just set the X-CS for us? [07:20:07] again, no objections there :) [07:20:36] i could concentrate on the zero extension then [07:21:08] I'm saying that I think this may fall into ops territory [07:21:32] but I'm also saying that we're busy and this may take some time [07:21:49] yep :( hence i'm trying to help out :) [07:21:55] it works now, so it shouldn't be a huge deal to wait [07:22:15] you have a point too. [07:22:19] maybe i should push this back [07:23:33] will talk to dfoy tomorrow, see what he thinks. I would much rather not deal with it obviously, although a general knowledge of varnish & puppets might come in handy [07:24:03] nod [07:24:15] so we'll obviously help you if you want to get varnish/puppet experience [07:24:24] but I don't think you should feel obligated to fix this [07:24:52] PROBLEM - Puppet freshness on snapshot3 is CRITICAL: Puppet has not run in the last 10 hours [07:25:49] yurik, is it something like early morning in your parts?:) [07:25:52] PROBLEM - Puppet freshness on mw1058 is CRITICAL: Puppet has not run in the last 10 hours [07:25:52] PROBLEM - Puppet freshness on mw1061 is CRITICAL: Puppet has not run in the last 10 hours [07:25:52] PROBLEM - Puppet freshness on mw1070 is CRITICAL: Puppet has not run in the last 10 hours [07:25:52] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [07:25:52] PROBLEM - Puppet freshness on mw1085 is CRITICAL: Puppet has not run in the last 10 hours [07:25:52] PROBLEM - Puppet freshness on mw1119 is CRITICAL: Puppet has not run in the last 10 hours [07:25:52] PROBLEM - Puppet freshness on mw1157 is CRITICAL: Puppet has not run in the last 10 hours [07:25:53] PROBLEM - Puppet freshness on mw1170 is CRITICAL: Puppet has not run in the last 10 hours [07:25:53] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [07:25:54] PROBLEM - Puppet freshness on mw48 is CRITICAL: Puppet has not run in the last 10 hours [07:25:54] PROBLEM - Puppet freshness on mw56 is CRITICAL: Puppet has not run in the last 10 hours [07:25:55] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [07:25:55] PROBLEM - Puppet freshness on srv235 is CRITICAL: Puppet has not run in the last 10 hours [07:25:56] PROBLEM - Puppet freshness on srv274 is CRITICAL: Puppet has not run in the last 10 hours [07:25:56] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [07:26:05] paravoid, so how would this cross-team request be done? Should i tell dfoy to talk to asher? or who puts it into ops backlog? [07:26:16] MaxSem, yeah, 3:30am, best time to be productive :) [07:26:27] RT [07:26:41] and then all-out assault onIRC [07:26:52] PROBLEM - Puppet freshness on analytics1006 is CRITICAL: Puppet has not run in the last 10 hours [07:26:52] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Puppet has not run in the last 10 hours [07:26:52] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [07:26:52] PROBLEM - Puppet freshness on hooper is CRITICAL: Puppet has not run in the last 10 hours [07:26:52] PROBLEM - Puppet freshness on mw1039 is CRITICAL: Puppet has not run in the last 10 hours [07:26:52] PROBLEM - Puppet freshness on mw1177 is CRITICAL: Puppet has not run in the last 10 hours [07:26:53] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Puppet has not run in the last 10 hours [07:26:53] PROBLEM - Puppet freshness on mw1191 is CRITICAL: Puppet has not run in the last 10 hours [07:26:55] PROBLEM - Puppet freshness on mw120 is CRITICAL: Puppet has not run in the last 10 hours [07:26:55] PROBLEM - Puppet freshness on mw67 is CRITICAL: Puppet has not run in the last 10 hours [07:26:55] PROBLEM - Puppet freshness on mw78 is CRITICAL: Puppet has not run in the last 10 hours [07:26:55] PROBLEM - Puppet freshness on mw96 is CRITICAL: Puppet has not run in the last 10 hours [07:26:56] PROBLEM - Puppet freshness on search1009 is CRITICAL: Puppet has not run in the last 10 hours [07:26:56] PROBLEM - Puppet freshness on search1013 is CRITICAL: Puppet has not run in the last 10 hours [07:26:57] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [07:26:57] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [07:26:58] PROBLEM - Puppet freshness on srv239 is CRITICAL: Puppet has not run in the last 10 hours [07:26:58] PROBLEM - Puppet freshness on srv247 is CRITICAL: Puppet has not run in the last 10 hours [07:26:59] PROBLEM - Puppet freshness on srv252 is CRITICAL: Puppet has not run in the last 10 hours [07:26:59] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Puppet has not run in the last 10 hours [07:27:48] yurik: why asher? :) [07:27:52] PROBLEM - Puppet freshness on mw1013 is CRITICAL: Puppet has not run in the last 10 hours [07:27:52] PROBLEM - Puppet freshness on mw1036 is CRITICAL: Puppet has not run in the last 10 hours [07:27:52] PROBLEM - Puppet freshness on mw1019 is CRITICAL: Puppet has not run in the last 10 hours [07:27:52] PROBLEM - Puppet freshness on mw1055 is CRITICAL: Puppet has not run in the last 10 hours [07:27:52] PROBLEM - Puppet freshness on mw1076 is CRITICAL: Puppet has not run in the last 10 hours [07:27:52] PROBLEM - Puppet freshness on mw1167 is CRITICAL: Puppet has not run in the last 10 hours [07:27:52] PROBLEM - Puppet freshness on mw1195 is CRITICAL: Puppet has not run in the last 10 hours [07:27:53] PROBLEM - Puppet freshness on mw17 is CRITICAL: Puppet has not run in the last 10 hours [07:27:53] PROBLEM - Puppet freshness on mw63 is CRITICAL: Puppet has not run in the last 10 hours [07:27:54] PROBLEM - Puppet freshness on mw71 is CRITICAL: Puppet has not run in the last 10 hours [07:27:54] PROBLEM - Puppet freshness on mw80 is CRITICAL: Puppet has not run in the last 10 hours [07:27:55] PROBLEM - Puppet freshness on mw84 is CRITICAL: Puppet has not run in the last 10 hours [07:27:55] PROBLEM - Puppet freshness on snapshot1001 is CRITICAL: Puppet has not run in the last 10 hours [07:27:56] PROBLEM - Puppet freshness on srv254 is CRITICAL: Puppet has not run in the last 10 hours [07:27:56] PROBLEM - Puppet freshness on srv257 is CRITICAL: Puppet has not run in the last 10 hours [07:27:59] yurik: RT is the way to go [07:28:52] PROBLEM - Puppet freshness on analytics1013 is CRITICAL: Puppet has not run in the last 10 hours [07:28:52] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Puppet has not run in the last 10 hours [07:28:52] PROBLEM - Puppet freshness on iron is CRITICAL: Puppet has not run in the last 10 hours [07:28:52] PROBLEM - Puppet freshness on mw1014 is CRITICAL: Puppet has not run in the last 10 hours [07:28:52] PROBLEM - Puppet freshness on mw1035 is CRITICAL: Puppet has not run in the last 10 hours [07:28:52] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Puppet has not run in the last 10 hours [07:28:52] PROBLEM - Puppet freshness on mw1079 is CRITICAL: Puppet has not run in the last 10 hours [07:28:53] PROBLEM - Puppet freshness on mw1098 is CRITICAL: Puppet has not run in the last 10 hours [07:28:53] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Puppet has not run in the last 10 hours [07:28:54] PROBLEM - Puppet freshness on mw1133 is CRITICAL: Puppet has not run in the last 10 hours [07:28:54] PROBLEM - Puppet freshness on mw1149 is CRITICAL: Puppet has not run in the last 10 hours [07:28:55] PROBLEM - Puppet freshness on mw1151 is CRITICAL: Puppet has not run in the last 10 hours [07:28:55] PROBLEM - Puppet freshness on mw1168 is CRITICAL: Puppet has not run in the last 10 hours [07:28:56] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Puppet has not run in the last 10 hours [07:28:56] PROBLEM - Puppet freshness on mw1172 is CRITICAL: Puppet has not run in the last 10 hours [07:28:57] PROBLEM - Puppet freshness on mw1216 is CRITICAL: Puppet has not run in the last 10 hours [07:28:57] PROBLEM - Puppet freshness on mw18 is CRITICAL: Puppet has not run in the last 10 hours [07:28:58] PROBLEM - Puppet freshness on mw32 is CRITICAL: Puppet has not run in the last 10 hours [07:28:58] PROBLEM - Puppet freshness on mw69 is CRITICAL: Puppet has not run in the last 10 hours [07:28:59] PROBLEM - Puppet freshness on mw86 is CRITICAL: Puppet has not run in the last 10 hours [07:28:59] PROBLEM - Puppet freshness on mw88 is CRITICAL: Puppet has not run in the last 10 hours [07:29:00] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [07:29:00] PROBLEM - Puppet freshness on search24 is CRITICAL: Puppet has not run in the last 10 hours [07:29:01] shut uuuuuuup [07:29:04] stupid puppet [07:29:13] I am wondering why it is broken on mw boxes [07:29:23] I merged a fix a few moments ago [07:29:44] MaxSem, paravoid, yeah, but it seems fairly big "RT" request - "please write a system to read our text file with IPv4 & v6 CIDR blocks mapping to a string ID, same way as geolocation country lookup? [07:29:52] PROBLEM - Puppet freshness on analytics1016 is CRITICAL: Puppet has not run in the last 10 hours [07:29:52] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [07:29:52] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [07:29:52] PROBLEM - Puppet freshness on mexia is CRITICAL: Puppet has not run in the last 10 hours [07:29:52] PROBLEM - Puppet freshness on mw1034 is CRITICAL: Puppet has not run in the last 10 hours [07:29:52] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Puppet has not run in the last 10 hours [07:29:52] PROBLEM - Puppet freshness on mw1156 is CRITICAL: Puppet has not run in the last 10 hours [07:29:53] PROBLEM - Puppet freshness on mw1183 is CRITICAL: Puppet has not run in the last 10 hours [07:29:53] PROBLEM - Puppet freshness on mw123 is CRITICAL: Puppet has not run in the last 10 hours [07:29:54] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [07:29:54] PROBLEM - Puppet freshness on mw72 is CRITICAL: Puppet has not run in the last 10 hours [07:29:55] PROBLEM - Puppet freshness on mw85 is CRITICAL: Puppet has not run in the last 10 hours [07:29:55] PROBLEM - Puppet freshness on mw97 is CRITICAL: Puppet has not run in the last 10 hours [07:29:56] PROBLEM - Puppet freshness on search26 is CRITICAL: Puppet has not run in the last 10 hours [07:29:56] PROBLEM - Puppet freshness on search32 is CRITICAL: Puppet has not run in the last 10 hours [07:29:57] PROBLEM - Puppet freshness on snapshot1004 is CRITICAL: Puppet has not run in the last 10 hours [07:29:57] PROBLEM - Puppet freshness on srv263 is CRITICAL: Puppet has not run in the last 10 hours [07:30:12] yurik: that is big? [07:30:30] yurik: we have RT requests that say "setup a new datacenter" or something [07:30:44] so, I'm not sure that a simple program like that can be considered a big request :) [07:30:51] compared to an RT ticket "please add yurik shell access" - yeah :) but yes, slightly smaller than a datacenter [07:30:52] how many of those? :P [07:30:52] PROBLEM - Puppet freshness on mw1002 is CRITICAL: Puppet has not run in the last 10 hours [07:30:52] PROBLEM - Puppet freshness on lardner is CRITICAL: Puppet has not run in the last 10 hours [07:30:52] PROBLEM - Puppet freshness on mw102 is CRITICAL: Puppet has not run in the last 10 hours [07:30:52] PROBLEM - Puppet freshness on mw1023 is CRITICAL: Puppet has not run in the last 10 hours [07:30:52] PROBLEM - Puppet freshness on mw1029 is CRITICAL: Puppet has not run in the last 10 hours [07:30:52] PROBLEM - Puppet freshness on mw1077 is CRITICAL: Puppet has not run in the last 10 hours [07:30:52] PROBLEM - Puppet freshness on mw1096 is CRITICAL: Puppet has not run in the last 10 hours [07:30:53] PROBLEM - Puppet freshness on mw1163 is CRITICAL: Puppet has not run in the last 10 hours [07:30:53] PROBLEM - Puppet freshness on mw1097 is CRITICAL: Puppet has not run in the last 10 hours [07:30:54] PROBLEM - Puppet freshness on mw1192 is CRITICAL: Puppet has not run in the last 10 hours [07:30:54] PROBLEM - Puppet freshness on mw1212 is CRITICAL: Puppet has not run in the last 10 hours [07:30:55] PROBLEM - Puppet freshness on mw27 is CRITICAL: Puppet has not run in the last 10 hours [07:30:55] PROBLEM - Puppet freshness on mw44 is CRITICAL: Puppet has not run in the last 10 hours [07:30:56] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [07:30:56] PROBLEM - Puppet freshness on mw89 is CRITICAL: Puppet has not run in the last 10 hours [07:30:57] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [07:30:57] PROBLEM - Puppet freshness on search1017 is CRITICAL: Puppet has not run in the last 10 hours [07:30:58] PROBLEM - Puppet freshness on srv241 is CRITICAL: Puppet has not run in the last 10 hours [07:30:58] PROBLEM - Puppet freshness on srv279 is CRITICAL: Puppet has not run in the last 10 hours [07:30:59] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [07:30:59] PROBLEM - Puppet freshness on wtp1 is CRITICAL: Puppet has not run in the last 10 hours [07:31:00] PROBLEM - Puppet freshness on srv288 is CRITICAL: Puppet has not run in the last 10 hours [07:31:25] MaxSem, i suspect they have 3 or 4 RT tickets like that... its been backordered for the past 5 years ;) [07:31:52] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: Puppet has not run in the last 10 hours [07:31:52] PROBLEM - Puppet freshness on mw1005 is CRITICAL: Puppet has not run in the last 10 hours [07:31:52] PROBLEM - Puppet freshness on mw1067 is CRITICAL: Puppet has not run in the last 10 hours [07:31:52] PROBLEM - Puppet freshness on mw1152 is CRITICAL: Puppet has not run in the last 10 hours [07:31:52] PROBLEM - Puppet freshness on mw1209 is CRITICAL: Puppet has not run in the last 10 hours [07:31:52] PROBLEM - Puppet freshness on mw21 is CRITICAL: Puppet has not run in the last 10 hours [07:31:52] hashar: this https://gerrit.wikimedia.org/r/#/c/57199/ broke puppet [07:31:57] hashar: but jenkins didn't caught it :) [07:32:07] is there no alerts for failed puppet runs instead of when it got really old? [07:32:19] MaxSem: ? [07:32:52] PROBLEM - Puppet freshness on kaulen is CRITICAL: Puppet has not run in the last 10 hours [07:32:52] PROBLEM - Puppet freshness on mw1006 is CRITICAL: Puppet has not run in the last 10 hours [07:32:52] PROBLEM - Puppet freshness on mw1012 is CRITICAL: Puppet has not run in the last 10 hours [07:32:52] PROBLEM - Puppet freshness on mw1016 is CRITICAL: Puppet has not run in the last 10 hours [07:32:52] PROBLEM - Puppet freshness on mw1193 is CRITICAL: Puppet has not run in the last 10 hours [07:32:52] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [07:32:52] PROBLEM - Puppet freshness on mw19 is CRITICAL: Puppet has not run in the last 10 hours [07:32:53] PROBLEM - Puppet freshness on mw1028 is CRITICAL: Puppet has not run in the last 10 hours [07:32:53] PROBLEM - Puppet freshness on mw1071 is CRITICAL: Puppet has not run in the last 10 hours [07:32:54] PROBLEM - Puppet freshness on mw24 is CRITICAL: Puppet has not run in the last 10 hours [07:32:54] PROBLEM - Puppet freshness on mw22 is CRITICAL: Puppet has not run in the last 10 hours [07:32:55] PROBLEM - Puppet freshness on search1020 is CRITICAL: Puppet has not run in the last 10 hours [07:32:55] PROBLEM - Puppet freshness on mw77 is CRITICAL: Puppet has not run in the last 10 hours [07:32:56] PROBLEM - Puppet freshness on mw73 is CRITICAL: Puppet has not run in the last 10 hours [07:32:56] PROBLEM - Puppet freshness on mw31 is CRITICAL: Puppet has not run in the last 10 hours [07:32:57] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [07:32:57] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [07:32:58] PROBLEM - Puppet freshness on srv236 is CRITICAL: Puppet has not run in the last 10 hours [07:32:58] PROBLEM - Puppet freshness on srv256 is CRITICAL: Puppet has not run in the last 10 hours [07:32:59] PROBLEM - Puppet freshness on srv265 is CRITICAL: Puppet has not run in the last 10 hours [07:32:59] PROBLEM - Puppet freshness on srv248 is CRITICAL: Puppet has not run in the last 10 hours [07:33:00] PROBLEM - Puppet freshness on snapshot2 is CRITICAL: Puppet has not run in the last 10 hours [07:33:00] PROBLEM - Puppet freshness on mw1064 is CRITICAL: Puppet has not run in the last 10 hours [07:33:01] PROBLEM - Puppet freshness on srv277 is CRITICAL: Puppet has not run in the last 10 hours [07:33:14] why it doesn't say "failed puppet run on mw666: error message here"? [07:33:15] paravoid, should it be in "ops-requsets" ? [07:33:28] paravoid: looking [07:33:52] PROBLEM - Puppet freshness on mw108 is CRITICAL: Puppet has not run in the last 10 hours [07:33:52] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [07:33:52] PROBLEM - Puppet freshness on analytics1025 is CRITICAL: Puppet has not run in the last 10 hours [07:33:52] PROBLEM - Puppet freshness on mw101 is CRITICAL: Puppet has not run in the last 10 hours [07:33:52] PROBLEM - Puppet freshness on analytics1005 is CRITICAL: Puppet has not run in the last 10 hours [07:33:52] PROBLEM - Puppet freshness on mw54 is CRITICAL: Puppet has not run in the last 10 hours [07:33:53] PROBLEM - Puppet freshness on mw29 is CRITICAL: Puppet has not run in the last 10 hours [07:33:53] PROBLEM - Puppet freshness on mw1113 is CRITICAL: Puppet has not run in the last 10 hours [07:33:54] PROBLEM - Puppet freshness on mw1207 is CRITICAL: Puppet has not run in the last 10 hours [07:33:54] PROBLEM - Puppet freshness on mw1160 is CRITICAL: Puppet has not run in the last 10 hours [07:33:55] PROBLEM - Puppet freshness on mw1037 is CRITICAL: Puppet has not run in the last 10 hours [07:33:55] PROBLEM - Puppet freshness on mw105 is CRITICAL: Puppet has not run in the last 10 hours [07:33:56] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [07:33:56] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [07:33:57] PROBLEM - Puppet freshness on mw94 is CRITICAL: Puppet has not run in the last 10 hours [07:33:57] PROBLEM - Puppet freshness on xenon is CRITICAL: Puppet has not run in the last 10 hours [07:33:58] PROBLEM - Puppet freshness on srv295 is CRITICAL: Puppet has not run in the last 10 hours [07:33:58] PROBLEM - Puppet freshness on snapshot1003 is CRITICAL: Puppet has not run in the last 10 hours [07:33:59] PROBLEM - Puppet freshness on mw1187 is CRITICAL: Puppet has not run in the last 10 hours [07:33:59] PROBLEM - Puppet freshness on mw103 is CRITICAL: Puppet has not run in the last 10 hours [07:34:43] yurik: yeah, a simple mail should suffice [07:34:55] paravoid: that is because Jenkins only runs "puppet parser validate" that does not really do anything :-( [07:34:57] i'm actually filling out a new ticket [07:35:05] paravoid: I think I understand MaxSem's point, and it's a good one. The problem is not "Puppet freshness". Things aren't stale or neglected or old. Puppet is industriously running on each of those hosts but just barfing when it encounters the bad change [07:35:06] you can file tickets via mail [07:35:09] but rt.wm.org works :) [07:35:16] paravoid: to catch that kind of error (i.e. a fact being passed a wrong parameter, we need unit tests :D ) [07:35:44] ori-l, MaxSem: you're right, but we don't have the puppet report service set up or a nagios check to check this [07:35:57] so I asked why:) [07:36:20] plus, puppet's error messages are so cryptic [07:36:31] that we'd still want to login and see what's going on [07:36:43] but yes, you're right, this could use some enhancement [07:37:22] don't we have the errors reported in syslog too? [07:38:38] we do :-] [07:38:49] Apr 3 06:02:33 10.0.11.103 puppet-agent[11334]: Failed to apply catalog: Parameter key failed: Key must not contain whitespace: ssh-rsa AAA [07:39:50] ori-l: thanks for the fix btw :) [07:40:11] np [07:51:52] New patchset: Ori.livneh; "Make icinga alert re: puppet client more precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57262 [07:53:13] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57262 [08:05:09] New patchset: Hashar; "Some clean up work to help get the package into Debian" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/57263 [08:08:09] RECOVERY - Puppet freshness on mw13 is OK: puppet ran at Wed Apr 3 08:08:00 UTC 2013 [08:08:09] RECOVERY - Puppet freshness on mw74 is OK: puppet ran at Wed Apr 3 08:08:01 UTC 2013 [08:08:09] RECOVERY - Puppet freshness on mw1105 is OK: puppet ran at Wed Apr 3 08:08:01 UTC 2013 [08:08:09] RECOVERY - Puppet freshness on mw1084 is OK: puppet ran at Wed Apr 3 08:08:01 UTC 2013 [08:08:09] RECOVERY - Puppet freshness on analytics1009 is OK: puppet ran at Wed Apr 3 08:08:01 UTC 2013 [08:08:09] RECOVERY - Puppet freshness on grosley is OK: puppet ran at Wed Apr 3 08:08:01 UTC 2013 [08:20:55] New patchset: Hashar; "Some clean up work to help get the package into Debian" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/57263 [08:21:11] New review: Hashar; "Fix tab in debian/changelog" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/57263 [08:23:09] RECOVERY - Puppet freshness on mw1145 is OK: puppet ran at Wed Apr 3 08:23:06 UTC 2013 [09:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [10:13:53] New patchset: Yurik; "(RT 4835) Apparently api logs were moved to emery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57271 [10:14:34] New patchset: Yurik; "(RT 4835) Add non-sudo yurik to emery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57271 [10:31:06] mark, after a number of failed attempts to get varnish up through the puppets (it has been showing very strange messages), paravoid suggested that the whole issue be deferred to the ops team, since you are much better equipped to solve it :) I posted the ticket https://rt.wikimedia.org/Ticket/Display.html?id=4881 describing the needed functionality in depth. This way we can concentrate on... [10:31:07] ...the zero extension, removing redirects, proper landing page, etc [10:31:27] ok [10:32:10] when you have a moment, please comment your thoughts. No rush obviously :) [10:43:26] !log Deactivated AS6908 peering on cr2-knams [10:43:34] Logged the message, Master [10:57:32] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [11:00:32] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [11:27:39] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [11:31:39] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [11:35:39] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [11:44:12] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [11:44:12] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [11:44:12] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [11:45:38] New patchset: Faidon; "pybal: sort monitor list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57282 [11:46:07] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57282 [12:03:32] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [12:08:26] New review: Diederik; "I would rather setup rsync of the API logs to stat1 than hand out access to emery, the machine is ve..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/57271 [12:15:39] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [12:18:29] PROBLEM - search indices - check lucene status page on search1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 62805 bytes in 0.016 second response time [12:18:39] PROBLEM - Varnish HTCP daemon on cp3003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (varnishhtcpd), args varnishhtcpd worker [12:19:19] PROBLEM - SSH on cp3003 is CRITICAL: Connection refused [12:19:29] PROBLEM - Varnish HTTP upload-backend on cp3003 is CRITICAL: Connection refused [12:19:38] !log Rebooting cp3003 for RAID reconfiguration [12:19:46] Logged the message, Master [12:24:29] PROBLEM - Varnish HTTP upload-frontend on cp3003 is CRITICAL: Connection timed out [12:26:19] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [12:28:59] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [12:29:19] RECOVERY - SSH on cp3003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [12:29:29] RECOVERY - Varnish HTTP upload-backend on cp3003 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.164 second response time [12:29:29] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 82.96 ms [12:29:39] RECOVERY - Varnish HTCP daemon on cp3003 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [12:30:19] RECOVERY - Varnish HTTP upload-frontend on cp3003 is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.165 second response time [12:32:39] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [12:48:36] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [12:52:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [12:57:20] New patchset: Reedy; "Remove wgUseMemCached, died in 1.17" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57206 [12:57:40] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57206 [12:57:58] New patchset: Reedy; "(bug 46489) Set wmgBabelCategoryNames for Ukrainian Wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56420 [12:58:04] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56420 [12:58:22] New patchset: Reedy; "(bug 46154) Override $wgGroupPermissions for thwiki Add abusefilter-log-detail and patrol for autoconfirmed on thwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56564 [12:58:29] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56564 [12:59:43] New patchset: Reedy; "(bug 45643) Add new user groups to urwiki with specific rights Add abusefilter and rollbacker user groups, modify $wgAddGroups for crats and sysops, modify $wgRemoveGroups for crats" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56578 [12:59:48] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56578 [13:02:36] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [13:10:25] !log reedy synchronized wmf-config/ [13:10:32] Logged the message, Master [13:11:08] Reedy: Can't connect to MySQL server on '10.64.16.158' (4)) ondewiki [13:11:19] uhh [13:11:47] pc1003 [13:12:45] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=MySQL+eqiad&h=pc1003.eqiad.wmnet&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [13:12:59] it's in trouble [13:13:13] i'll stop mysql [13:13:46] Do we need to remove it from the MW config too? [13:13:53] i'll start it in a bit, so not yet [13:14:04] alright [13:14:35] starting [13:14:50] back up [13:15:35] !log Stopped and started MySQL on pc1003 after finding mysql deadlocked [13:15:41] Logged the message, Master [13:18:01] New patchset: Reedy; "Document parser cache IPs in db files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57287 [13:19:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57287 [13:28:36] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [13:32:36] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [13:40:48] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: Timeout while attempting connection [13:48:40] New patchset: Mark Bergsma; "Disable Tomasz's account" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57294 [13:48:41] New patchset: Mark Bergsma; "Attempt to keep 20% SSD space free" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57295 [13:49:29] oops [13:49:57] Change abandoned: Mark Bergsma; "Previously done by Ariel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57294 [13:51:18] New patchset: Mark Bergsma; "Attempt to keep 20% SSD space free" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57295 [13:53:13] New patchset: Mark Bergsma; "Attempt to keep 20% SSD space free" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57295 [13:53:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57295 [14:02:38] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [14:07:38] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:09:08] PROBLEM - Puppet freshness on db1053 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:36] New patchset: Hashar; "contint: install ruby1.9.3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57296 [14:11:21] hey mark, just wondering what's the status for "ACL analytics subnets from the rest of the network" (https://rt.wikimedia.org/Ticket/Display.html?id=4433), is that finished? [14:11:28] no [14:12:08] are you waiting for any information from our side? [14:12:15] no, i'm waiting on time ;) [14:12:26] ok got it :) [14:12:29] thx [14:13:43] root: I am in need of ruby1.9.3 on gallium to syntax check the ruby 1.9 scripts. I have added a package to contint module https://gerrit.wikimedia.org/r/57296 [14:13:43] would anyone please approve the tiny change? Thx! [14:13:43] hashar: Please stop swearing [14:14:27] you've promised to split packages.pp :) [14:15:00] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57296 [14:15:35] hashar: everything ok with jsduck btw? [14:15:36] or is this Krinkle's domain? [14:16:43] paravoid: ah true [14:17:28] paravoid: yeah that is Krinkle :-] I am not aware of any specific issue though I haven't followed that subject closely [14:17:55] paravoid: and yeah I need to split packages.pp . Maybe I should use subclasses [14:21:53] !log cmjohnson synchronized wmf-config/db-eqiad.php 'setting weight on db1028 to 400' [14:22:00] Logged the message, Master [14:22:56] paravoid: thank you!!! We can now lints the qa/browsertests.git ruby scripts :-] [14:25:05] New patchset: Ottomata; "Rsyncing API logs from emery to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57298 [14:25:17] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57298 [14:33:38] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [14:34:32] New patchset: Mark Bergsma; "Make the empty partition a primary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57301 [14:35:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57301 [14:35:06] yurik: https://gerrit.wikimedia.org/r/57298 should allow you to analyze the api logs on stat1, maybe you can abandon https://gerrit.wikimedia.org/r/#/c/57271/ ? [14:37:38] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:40:42] Hm. Dell's PXE apparently doesn't support console redirection. How... useful. [14:41:02] what do you mean? [14:41:41] Console redirection (from the DRAC) works perfectly up to the point where PXE starts, then... no output. [14:42:03] at that point grub should output that [14:42:09] Actually, that's probably the NIC's firmware's bug, not Dell's. [14:42:21] ah PXE itself you mean [14:42:26] mark: provided PXE /worked/ and that's not what you're trying to debug. :-) [14:42:35] you're getting confused [14:42:36] it's possible that it would work with "redirection after boot" [14:42:41] drac's console redirection works fine [14:42:44] but that needs to be disabled for grub output to work [14:43:45] New patchset: Hashar; "gerrit-wm now sends translatewiki notif to #mediawiki-i18n" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57302 [14:43:49] At any rate, I suspect is just a "network not actually wired" problem given that I don't even see DHCP requests at all from the box. [14:44:02] console redirection is interception of int10h and int16h to take text video output and write it to the serial port [14:44:11] this is suboptimal to do for grub though [14:44:31] grub can write to the serial port, which is way better [14:44:43] PXE loading itself however... [14:44:46] (grub2 can write menus to both vga and serial, way better than grub 1) [14:45:09] if, however, the interrupt handler tries to write to serial and grub also tries to write to serial at the same time [14:45:11] paravoid: I'm talking about PXE itself, not about what happens once I got a bootloader in. :-) [14:45:13] you get garbled output at best [14:49:24] i hate partman-auto [14:55:18] New patchset: Mark Bergsma; "Add missing period" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57303 [14:56:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57303 [15:02:11] New review: Diederik; "See https://gerrit.wikimedia.org/r/57298 for the rsync of api logs to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57271 [15:02:34] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [15:11:13] New patchset: Mark Bergsma; "Adjust partition priorities" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57306 [15:11:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57306 [15:27:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:19] New patchset: Mark Bergsma; "Adjust priorities/partition sizes, suppress prompt" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57308 [15:28:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:30:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57308 [15:30:21] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:32:21] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [15:36:32] who can look at deb packages that would need to to to apt.wikimedia.org? created rt 4868 with links to a new version of libvpx to fix some video transcoding issues [15:36:55] New patchset: Lcarr; "hopefully adding bond-master bond0 to all of the sub-interfaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57310 [15:37:05] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:39:04] j^: whoever is on rt duty - but it looks ike you just submitted the ticket so be patient [15:40:31] LeslieCarr: cool, just wanted to make sure it gets seen. [15:45:00] coren: looking at labstore1001 now (still no pxe) [15:45:54] cmjohnson1: You see the PXE fail on the physical console, I take it. Want me to monitor DHCP activity while you try? [15:46:22] sure give me a few mins to look at it firs [15:46:57] kk [15:59:07] coren: still not sure what the deal is...i checked dhcpd, dns, network and all appears normal. The fact that it is not even making it out to brewster is troubling me. oh, and i tried a new cable. [15:59:17] have you tried new pxe labstore1002 yet? [15:59:40] cmjohnson1: I tried an hour ago or so. Want me to give it another whack? [15:59:56] so labstore1002 is not pxe booting either? [16:00:06] And I can confirm I see no DHCPDISCOVER from those mac addresses. [16:00:11] cmjohnson1: None of the four are. [16:00:44] okay...that eliminates h/w issue [16:00:58] unlikely that all 4 would have a h/w problem [16:00:59] That or we are REALLY unlucky. :-) [16:01:07] New patchset: Ottomata; "Move geoip to a module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53714 [16:01:39] Hm. Actually, I lied. I didn't try 1004. Want me to, just in case? [16:01:54] yes please...thx [16:02:09] j^: that needs to go into git [16:02:16] New review: Ottomata; ">What's the purpose of the GeoIP/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53714 [16:02:19] (gerrit actually) [16:02:52] and it helps if i get the letters in the right order... [16:03:05] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [16:03:34] jeremyb_: what module? [16:03:47] j^: gets its own new repo [16:04:12] cmjohnson1: firing it up [16:04:26] jeremyb_: ok thats new compared to the last deb i made, is there some wiki page with the workflow? [16:04:33] j^: which is the last? [16:04:51] jeremyb_: ffmpeg2theora [16:05:09] j^: https://wikitech.wikimedia.org/wiki/Git-buildpackage#Pushing_changes_into_Gerrit maybe? [16:05:58] cmjohnson1: Hm. Incidentally, I only see one shelf on 1004 [16:06:45] cmjohnson1: And no PXE joy; same deal as the others; no DCHP activity on brewster. [16:08:51] ahhh, tiago :) [16:08:59] New patchset: Hashar; "package-builder learned 'cowbuilder'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [16:10:18] cmjohnson1: afk for a few, getting lunch [16:10:29] coren: okay [16:12:47] New review: Hashar; "I am not sure who is using that class, but I really need cowbuilder so I have tweaked it to support ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56382 [16:14:06] gah, what is it with me and spelling today [16:14:30] first vxp/vpx. now mpeg / pmeg [16:20:00] cmjohnson1: back [16:20:16] LeslieCarr, is this something you could respond to? https://rt.wikimedia.org/Ticket/Display.html?id=4875 [16:21:01] coren: think i figured out the problem...not set up in a vlan on the network [16:21:14] cmjohnson1: Yeah, I saw. :-) [16:23:04] andrewbogott: i thought about giving that one to leslie. and then i thought whatever the answer is yurik's not going to like it. :-) [16:23:57] there's been talk of cutting LVS IPs with the new unified cert... [16:24:08] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:24:12] New patchset: Mark Bergsma; "Add cp3005/3006 to Puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57313 [16:24:53] New review: Hashar; "The original import was completely wrong and based on another tarball :-]" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/56602 [16:25:40] New review: Hashar; "So lets land it on apt.wikimedia.org ? :-]" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [16:26:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57313 [16:27:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.715 second response time [16:32:08] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [16:32:53] New review: Siebrand; "I support the specification. I cannot assess the implementation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57302 [16:33:45] paravoid: I still have the hack in place on gallium that overrides the ext-js location, let me check [16:33:49] paravoid: Nope, still broken. [16:33:53] https://doc.wikimedia.org/mediawiki-core/master/js/extjs/ext-all.js [16:33:59] 3.0.3 instead of 4.1 [16:34:14] !log authdns update [16:34:20] Hm.. Maybe... hold on [16:34:21] Logged the message, Master [16:34:26] https://doc.wikimedia.org/VisualEditor/master/extjs/ext-all.js [16:34:31] paravoid: Perfect :) [16:35:25] paravoid: jsduck is fully operational now without hacks [16:35:42] er? [16:35:45] but I fixed the package [16:35:48] and upgraded [16:36:45] PROBLEM - Host cp3005 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:55] RECOVERY - Host cp3005 is UP: PING OK - Packet loss = 0%, RTA = 89.38 ms [16:39:06] late comment. [16:40:02] New patchset: Mark Bergsma; "Add cp3005 to the pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57315 [16:40:18] cmjohnson1: Give me a ping when the vlan is set? [16:40:24] k [16:42:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57315 [16:49:05] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:49:56] coren: let's give labstore1001 a go [16:50:41] cmjohnson1: Firing it up [16:53:11] looks to be failing still [16:53:16] cmjohnson1: No joy. :-( [16:53:28] kk [16:53:57] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57259 [16:54:08] !log reedy synchronized php-1.21wmf12/includes/Collation.php [16:54:09] Hm. Stupid question; could the nics be swapped? [16:54:19] Logged the message, Master [16:55:06] !summon thehelpfulone [16:55:13] very doubtful [16:57:18] 13:15 mark: Stopped and started MySQL on pc1003 after finding mysql deadlocked [16:57:21] speak of the devil, heh [16:57:57] coren: see seurity but I think the vlan change did not take [16:58:37] New patchset: Nemo bis; "Add ganglia graph for global jobqueue length" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [16:58:38] robh: the interfaces were in default vlan [16:58:57] ge 2/0/0 -2/0/1 and 3/0/0 to 3/01 [16:59:02] ahhh [16:59:05] default vlan is fine [16:59:16] i had issue where the port was in a range of ports put into a vlan [16:59:25] and juniper os is too stupid to just pull out the single port [16:59:29] have to redo the ranges =P [16:59:32] mutante: hey. want to do dns for 4355 and i'll do apache in gerrit? [16:59:47] cmjohnson1: want me to give it a shot and see what i see? [16:59:47] mutante: why did you need THO? [16:59:54] please [17:00:03] ok, checking it out now [17:00:13] they need to goto labs-host1 [17:00:42] mutante: (just a symlink to wikipedia.com i assume) [17:00:48] cmjohnson1: tryign to put into what vlan? [17:00:56] labs instaces1-c-eqiad? [17:01:04] (spelling is off but that one?) [17:01:04] i want them in labs-host1-c-eqiad [17:01:10] ok [17:01:13] thx [17:01:38] jeremyb_: yes, but later today, need to finish what i am on [17:01:54] jeremyb_: i wanted him for a discussion about (private) mailing lists [17:01:57] bbiaw [17:02:14] mutante: ahh. in that case see 4880 :) see you later [17:03:03] hrmm, show | compare shows my change, commit.. no errors, and yea..... its still in default [17:03:05] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [17:03:11] cmjohnson1: wtf [17:03:13] =P [17:03:16] yep... [17:03:19] no idea [17:03:28] damn it, this is gonna annoy me until i know. [17:03:51] it may be some setup thing has to happen to that vlan [17:03:56] cuz it has no actual ports assigned to it yet [17:04:02] so it may not be fully deployed on the stack properly [17:04:07] are these the first labs hosts in row C? :) [17:04:09] yep [17:04:16] yeah that's likely not fully setup [17:04:29] cmjohnson1: So, sounds like you get to make a ticket in networking [17:04:38] unless mark wants to fix it up now ;] [17:04:43] in fact [17:04:50] chris just created the interface range, which does nothing [17:05:28] cmjohnson1: So it just stumbled from shit you and i can do to shit we need our network admins to do ;] [17:06:58] jeremyb_, not sure what you meant :) [17:08:40] yurik: you're potentially going to need to split IPs in 2 (eventually?) and other people are already working on combining IPs (the reverse of you) [17:09:01] true that [17:09:18] its just that telcos have much harder time filtering by URL [17:09:31] esp for the large volume site like ours [17:10:08] mark: what needs to be done for it to work? [17:10:45] yurik: we change IP addresses regularly, and it's part of our failover setup as well [17:10:46] yurik: so we have some images that are over 1GB I think. or at least a whole lot of them at 100MB. what's to stop people from downloading those instead of videos? (if videos are blocked). just assuming there won't be demand? [17:10:55] so external people explicitly using our IP addresses is not supported [17:11:04] it will break often [17:11:39] mark: what about by rdns and pattern matching on domain name? [17:11:58] who does that? [17:12:07] i'm saying they could do that [17:12:16] instead of hardcoding [17:12:33] !log aaron synchronized php-1.21wmf12/includes/objectcache/SqlBagOStuff.php 'deployed 61587acc64cb62400ff7978271c54e8bd8b1f02d' [17:12:34] sounds pretty horrible to me [17:12:40] jeremyb_, in reality some of them don't even want the images :) [17:12:41] Logged the message, Master [17:13:11] at least some old ones we signed up had no image settings [17:13:13] yurik: right, i got that [17:14:12] !log aaron synchronized php-1.22wmf1/includes/objectcache/SqlBagOStuff.php 'deployed b61053ca4b554f6bd18fb6408967839cdb5ccde2' [17:14:20] Logged the message, Master [17:17:22] ottomata, are the api logs now properly syncing to stat1? [17:18:05] yup! [17:18:14] cool, thx :) [17:18:19] /a/squid/archive/api [17:18:27] do you know why they didn't? [17:18:31] they were never set to [17:18:35] just out of curiousity [17:18:36] ah, ok [17:18:51] thx for fixing it! [17:19:57] Change abandoned: Yurik; "was rsynced, thanks ottomata!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57271 [17:26:05] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [17:26:47] mutante: can you figure out which of my recipients was / wasn't on the list? [17:31:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [17:34:26] jeremyb_: see 4880 [17:35:10] mutante: i meant in the private file [17:35:31] as it is / was. not as it will be made. :) [17:36:14] jeremyb_ andrewbogott : yeah, i think the answer was unsatisfactory [17:36:16] well to them [17:36:17] fine to me [17:36:19] weird that they don't have the same values in config for those options :) [17:36:25] jeremyb_: re 4355: softwarewikipedia.net is a link to mediawiki.org not to wikipedia.com [17:37:05] mutante: oooooh. interesting. i was going to do wikipedia.org but i guess mediawiki.org is relevant too [17:37:11] jeremyb_: yes, that is the contant problem, but it's decentralized, so what to do... [17:37:24] re: the mailing lists [17:37:29] yeah [17:37:34] federate! [17:37:42] it is centralized now [17:37:43] THO contacts list admins :p [17:37:49] we did that before [17:38:13] LeslieCarr: Thanks for responding, in any case. It's not unreasonable that we use dns :) [17:38:14] or take the debian approach: have only central admins as list admins and individual lists only have the moderator passwd [17:39:24] will result in dozens and dozens of tickets for the central admins [17:39:37] maybe if you hire a full time person for it, heh [17:40:24] yeah, i don't know much about how it works for them [17:40:31] but they are adamant about it [17:41:02] it doesn't make it easier that the policies are challenged all the time [17:41:24] re: advertised = 0 [17:41:38] and that mailman likes to use 0/1 and True for some reason :p [17:41:55] but never False? :) [17:42:05] couldnt find one at least :p [17:43:00] btw. checking values for all lists works like: for list in $(./list_lists -b); do echo -n ${list}\|; ./config_list -o - ${list} |grep "something" ; done | tee somefile.log [17:43:57] so it would be really a lot easier if "something" would be ONE thing that determines private or not [17:44:08] and not a combination of several [17:44:20] let me take a look at that puppet code again, brb [18:07:25] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: private, closed and fishbowl to 1.22wmf1 [18:07:32] Logged the message, Master [18:09:43] heh [18:14:58] New patchset: Ryan Lane; "Only use service groups and users for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57324 [18:16:20] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57324 [18:16:39] New patchset: Andrew Bogott; "Do a full MW clone instead of a shallow one." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57325 [18:21:23] who should i check with to verify if a particular Varnish ACL is deployed to production? dfoy was wondering if a merged ACL was actually pushed out yet. [18:23:43] dr0ptp4kt, if it's merged, it will be deployed by puppet in 30 minutes [18:26:38] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: special, wikimania and wikimedia wikis to 1.22wmf1 [18:26:46] Logged the message, Master [18:27:56] thanks, MaxSem. due to the strange hours for having carriers validate this stuff, if we want to check ahead of time during normal PST hours, what's the best way to validate? is there anyone with shell access who can check quickly? we think in due time we can re-arch some of the payload data to make it so we could tell on our own, but until then, what do you recommend for having someone tell us what's hot? [18:30:50] hey ops^^^:) [18:31:03] dr0ptp4kt: lava [18:31:04] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikiversity and wikivoyage to 1.22wmf1 [18:31:10] Logged the message, Master [18:31:25] hrm, i guess i can check -- or just force a puppet run ? [18:32:26] a live check would be best in this case. can you send me the hot config dump? [18:32:44] ^ LeslieCarr [18:33:20] yeah [18:33:33] LeslieCarr, thx [18:33:39] i owe you TWO now. [18:33:47] s/TWO/THREE/ [18:34:16] New patchset: Demon; "Set up weekly jgit gc operations for all repositories" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57327 [18:34:56] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikinews and wikisource to 1.22wmf1 [18:35:05] Logged the message, Master [18:37:20] New review: Demon; "Probably want to hold off another day or two...just in case. But yeah, this will be nice." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57327 [18:39:39] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wiktionary to 1.22wmf1 [18:39:46] Logged the message, Master [18:40:02] thx again, LeslieCarr! [18:41:04] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikiquote and wikibooks to 1.22wmf1 [18:41:10] Logged the message, Master [18:41:56] New patchset: coren; "Adding subnet labs-hosts1-c-eqiad to DHCP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57329 [18:41:59] !log DNS update - adding softwarewikipedia.com for RT-4355 [18:42:07] Logged the message, Master [18:42:38] New patchset: Reedy; "Everything non 'pedia to 1.22wmf1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57331 [18:42:50] !log restarting pdns on ns1 [18:42:58] Logged the message, Master [18:43:23] jeremyb_: dig softwarewikipedia.com [18:46:25] mutante: ok, but no apache yet i see [18:46:34] nope [18:46:45] mutante: do .org while you're in dns? i'll make a ticket for MM [18:48:52] !log Removed caesium and xenon from /home/wikipedia/common/docroot/noc/pybal/eqiad/parsoid [18:49:01] jeremyb: ah, another one that is ours but not using our DNS servers yet, yep doing so [18:49:02] Logged the message, Mr. Obvious [18:49:37] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57331 [18:49:39] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57329 [18:54:05] !log DNS update - also adding softwarewikipedia.org [18:54:12] Logged the message, Master [18:55:50] cmjohnson1: Success! [18:56:00] woot! [18:56:04] finally [18:56:18] So yeah, I also needed to add the subnet to DHCP. [18:56:58] so coren..when I rebooted 1001 earlier I think the raid may need to be fixed..i didn't see the VD's i added yesterday [18:57:23] cmjohnson1: That's because I removed them; we're experimenting with software raid over JBOD for this one. [18:57:52] ah..okay ..cool [18:58:14] Didn't touch the other three though. [18:58:41] ok [18:59:46] Ryan_Lane: how hard would it be to make testwiki use eqiad apaches? [19:00:22] New patchset: Catrope; "Remove xenon and caesium from Parsoid service, RobH is reclaiming them" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57334 [19:00:52] AaronSchulz: Surely it just needs all the specifics for it/srv193 stripping out and it should work? [19:01:10] I'd imagine it's easy, but it still hasn't been done [19:01:31] is there NFS in eqiad? [19:02:05] it's not mounted by the apaches [19:02:12] any such hacks should be removed [19:02:19] Well that's how testwiki used to run [19:02:20] AaronSchulz: probably not terribly hard [19:02:23] NFS mounted on srv193 [19:02:29] Direct synchronization with fenari [19:02:39] but we'd need to mount NFS in eqia [19:02:42] *eqiad [19:02:46] and I think we all want to move away from that [19:02:49] hm [19:02:50] actually [19:02:51] Yes [19:02:57] didn't we already move to tin? [19:02:58] So testwiki needs some sort of different architecture [19:03:08] I mean, it could run directly on the deploy host maybe? Is that excessively evil? [19:03:20] that's evil and I'll stab you [19:03:20] kind of [19:03:28] (I mean, it's a bit scary because DoSing test.wp.o DoSes the deploy host) [19:03:49] I don't want a world exposed web server on deployment ever again [19:03:53] * AaronSchulz takes away Roan's crank pipe [19:04:05] crank? [19:04:15] Reedy: crank/speed/meth [19:04:28] I thought the typo was of crack ;) [19:04:54] Reedy: http://en.wikipedia.org/wiki/Crank [19:05:15] haha, directly on deploy host [19:06:30] crack works too in this context? [19:08:33] Oh, ffs. The H700 doesn't have a JBOD mode; you need to actually make 12 raid0s [19:13:44] coren: yeah..i guess i should've told you that earlier [19:14:06] cmjohnson1: I'll live. :-) [19:14:33] we have yet to find a quality controller that supports jbod [19:14:54] JBOR0 will do the same. [19:15:00] faidon knows all about this from our search for controller for swift [19:34:24] Now /that/ is jbod. sda .. sdx :-) [19:51:34] Do we have a standard name for "storage partition"? Trying to name the volume group so that it'll be evident for everyone. [20:00:20] You guys don't normally have a separate /boot? [20:01:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:02:00] ... nor /var [20:02:07] Ryan_Lane: Help. :-) [20:02:22] Coren: do you already know the partman and netboot stuff in puppet? [20:02:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [20:02:35] should be existing partman recipes if you're lucky [20:02:56] mutante: Should be, but that's an atypical system. I can use it to guide the OS disk though. [20:02:57] /puppet/files/autoinstall/partman [20:03:37] i see, so to answer your question about having separate boot partitions [20:03:45] some have, some don't [20:03:59] f.e. lvm.cfg:mountpoint{ /boot } [20:04:05] in lvm.cfg , partman [20:04:18] Looking at lvm-noraid-large, which is closest to what I'm doing, there's a /boot [20:04:35] i would try to use an existing one if possible at all [20:04:45] and if not, create a new one as a copy of an existing one [20:06:13] * Coren fails to find a pattern between have /boot, vs dont and /var vs /run [20:07:06] And the gluster bricks don't seem to be there. :-) [20:07:43] Coren: legacy :) [20:08:00] it's always the answer to "why the fuck do you guys.... ?" [20:08:26] RECOVERY - Puppet freshness on db1053 is OK: puppet ran at Wed Apr 3 20:08:21 UTC 2013 [20:08:35] Do we know what the current fad is, then? :-) [20:09:22] Coren: see, I keep trying to push for us to switch to html9 responsive boilerstrap js [20:09:27] we only use /boot if it's necessary to have it [20:09:32] but it's a tough sell [20:09:34] e.g. with LVM [20:10:12] But but, /boot hasn't been necessary with lvm since LILO. :-) [20:10:40] Maybe grub1 still has trouble with it? [20:10:41] it was for something, otherwise we didn't add it ;) [20:10:48] not sure what the issue was [20:10:52] also not sure it's still relevant today [20:11:01] but it was necessary when we setup that partman recipe anyway ;) [20:11:16] Pretty sure it's not. Separate /var for bottling logs not traditional either I see. [20:11:43] i'm pretty sure it was necessary up to at least lucid [20:11:47] but precise, maybe not [20:12:38] This thing is just a fileserver, with the actual storage on another array. Just / it is. [20:21:21] Ryan_Lane: So, that config ends up with a raid1 split between controllers of the OS disk (1 each side), two 10-disk raid6 (5 each side) with LVM over it (32T usable), and two disks set aside for snapshots for replication and/or backups. [20:21:47] (Actually, the latter two are raid0 (one each side) for performance) [20:22:36] Performance wise, this should go wooosh! [20:24:46] Coren: hsoooooooooooooooow [20:29:34] RoanKattouw_away: So i see you put in patchset, shall I merge? [20:30:00] RobH: Yes please [20:30:20] Icinga will complain about the boxes otherwise [20:30:36] Although I guess it might still do that unless we put in an ensure=>absent, I don't know puppet well enough to tell [20:31:44] Coren: :) [20:34:14] !log udpated webstatscollector package in apt repo to 0.1-3 [20:34:21] Logged the message, Master [20:35:25] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57334 [20:35:54] RoanKattouw: So once its merged and live on cluster, what machines specifically need puppet updates before I can pull them? [20:36:07] I don't think anything does [20:36:12] cool, mediawiki handles shit then [20:36:15] huzzah! [20:36:15] They're already depooled in pybal [20:36:20] good times [20:36:25] thanks! [20:36:25] No, MW doesn't even care, MW just goes to the load balancer [20:36:29] which is already aware [20:36:42] what is the lvs server for these? [20:36:49] just internal lvs1003(or 3) [20:36:50] ? [20:36:55] The only thing is deployments might break for the next few hours until tin runs puppet, but we very rarely deploy changes anyway [20:36:57] Yes [20:37:05] lvs100{3,6} or something [20:37:10] ok, well, i'll push a puppet run on tin just in case [20:37:15] may as well eliminate potential error vectors [20:48:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56348 [20:53:19] New patchset: Lcarr; "temp removing caesium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57411 [20:53:40] PROBLEM - Host caesium is DOWN: PING CRITICAL - Packet loss = 100% [20:54:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57411 [21:05:35] PROBLEM - Parsoid on xenon is CRITICAL: Connection refused [21:06:58] New patchset: Dzahn; "decom xenon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57417 [21:07:48] What's up with https://upload.wikimedia.org/wikivoyage/he/a/a5/Luxembourg_districts.jpg ? [21:07:50] http 401... [21:09:17] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56640 [21:10:37] New patchset: MaxSem; "Check mobile site's HTTP status" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57419 [21:15:44] New patchset: Lcarr; "making caesium standard" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57420 [21:16:30] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57420 [21:31:59] !log Running DNS update [21:32:07] Logged the message, Master [21:40:18] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57417 [21:43:48] New patchset: Andrew Bogott; "Remove gluster's broken logrotate script." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57426 [21:44:13] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [21:44:13] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:44:13] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [21:44:44] !log started iwlinks index migrations on all wikis [gerrit 43389] [21:44:53] Logged the message, Master [21:59:46] New patchset: Lcarr; "switching caesium to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57428 [21:59:46] New patchset: Lcarr; "no longer decom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57429 [22:00:59] Ryan_Lane: I don't want to go on a tangent on the list. But actually yes I'm not a big fan of wikidata. [22:01:19] It has big user experience problems at the moment. [22:01:19] New patchset: Dereckson; "(bug 46856) Rights configuration for Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57430 [22:05:11] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57428 [22:05:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57429 [22:06:14] StevenW: it's a new project [22:06:20] those issues can be worked out [22:06:24] * Ryan_Lane shrugs [22:06:30] Yeah in the long run. [22:06:34] are you opposed to the concept of semantic annotations? [22:06:41] I'm not a big fan of Wikipedia. [22:06:48] heh [22:06:48] i'm not a huge fan of mediawiki - it has major scaling issues and horrible performance [22:06:48] It has big user experience problems at the moment. [22:06:54] LeslieCarr: +1 [22:07:00] heh [22:07:30] that said, it's better than most open source software for managing content [22:09:21] StevenW: mediawiki is actually horrible for organizing content [22:09:46] it works for wikipedia because we have 85k people or so working on it [22:09:58] And it hardly works for that, other than in articles ;) [22:10:09] and they are masochistic enough to deal with it [22:10:33] I'm not disputing the sucky parts of the current system. I just don't believe the idea that migrating to SMW is going to solve all our problems. [22:10:33] properly using things like SMW and SF make it manageable with a much smaller group of people [22:10:49] I don't believe anyone said anything about solving all of our problems [22:11:10] all our problems with sanely organizing MediaWiki documentation [22:11:24] so I was looking at the gazillion categories on en wp today [22:11:31] plus subcategories plus etc etc [22:11:38] fortunately no one uses them for anything [22:11:44] typical page access works like: [22:11:51] search in google. click link. done [22:12:49] StevenW: this isn't about mediawiki documentation [22:12:50] so we have a very nice reference for answering a specific question [22:12:54] it's about non-mediawiki documentation [22:13:04] project documentation and such [22:13:18] which end up being the same thing in a lot of cases [22:15:20] binasher: Is your script doing dewiki currently? [22:15:29] And/or wikidata [22:15:44] Reedy: nope, enwiki [22:15:52] the script I'm running for wikidata is complaining of 17536 lag... [22:15:54] I would like to see annotation and tags on current pages (though if we had it they wouldn' be searchable in any helpful way by us, only via google) [22:16:04] Reedy: running in pmtpa? [22:16:10] Reedy: that would be the pages logging [22:16:15] xml [22:16:23] it will be done in [22:16:24] Aha [22:16:40] 7 hours or something [22:16:48] but it's just the one db [22:16:56] Hmm. Do I leave it running, or cancel and restart it tomorrow.. [22:17:01] StevenW: what specifically is the same? [22:17:04] if you can get your script to find another slave.... [22:17:14] I can't think of any real examples there. [22:17:25] mediawiki extension documentation -> mediawiki.org [22:17:37] project planning for it -> wikitech [22:17:46] Ugh [22:17:46] apergos: It's using mediawikis wfwaitforslaves(), so it'd have to be moved to eqiad [22:17:47] so yet another wiki [22:17:54] It's not urgent by any extent [22:17:58] ok [22:18:01] sorry about that [22:18:06] Don't worry [22:18:15] StevenW: people should be writing infrastructure documentation for stuff in wikitech already [22:18:19] It's gonna take a very long time to run anyway [22:18:25] if they aren't there's something wrong [22:18:28] what are you running? [22:18:52] "Wikibase/repo/maintenance/rebuildTermsSearchKey.php" [22:19:14] it's kind of dickish to 3rd parties to stuff mediawiki.org full of wikimedia specific documentation anyway [22:20:56] I only need to find what is the first ID that is '', and restart it from there [22:29:10] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [22:29:35] anyone seen http://pastebin.com/idvJygKH before ? [22:29:45] haven't seen this error on sockpuppet before [22:32:28] RobH: ? [22:32:59] ottomata: it's easier to deal with labs if we try to centralize work to areas, rather than specific things [22:33:14] so, all analytics under analytics makes things easier [22:33:33] it's possible to manage sudo policy so that everyone doesn't have root on instances, if that's a worry [22:39:12] i have seen that. [22:39:15] i just dont recall when [22:39:25] I think when we added a new subnet to row C we had to update the puppet server to talk to it. [22:39:43] LeslieCarr: your rdns is probably wrong [22:40:03] ahha [22:40:05] that may have been it as well [22:40:06] rdns is wrong [22:40:10] cool [22:42:13] Ryan_Lane, cool, that's fine [22:42:19] i have no preference really, so we can do under analytics [22:42:20] danke [22:42:25] cool [22:42:33] ottomata: let me know if instances fail to create [22:42:39] you're very likely to hit quotas [22:42:57] mmk [22:43:09] yeah we have i dunno, 7ish instances already in that project [22:43:10] ? [22:43:11] maybe 5? [22:43:49] usually you'll hit a quota some time before 10 [22:43:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57430 [22:43:52] definitely at 10 [22:43:58] let me just increase it now [22:44:28] oh. that project has already been increased [22:44:55] I upped it a bit more [22:44:56] cool, whats the limit? [22:45:14] 20 instances, 60 cores, 51200MB RAM, [22:45:20] !log reedy synchronized wmf-config/InitialiseSettings.php [22:45:21] mmk, cool [22:45:25] should be fiiiiine [22:45:27] Logged the message, Master [22:45:29] yeah [22:45:29] i'm out for the eve, thanks Ryan! [22:45:33] yw [22:45:37] see a [22:45:39] &ya [22:45:44] ugh. typing [22:51:18] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:54:28] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [22:54:39] eep checking out lvs1001 [22:55:01] hrm, lvs1001 seems ok [22:55:28] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:02:17] New review: Krinkle; "*bump*. Please finish this or allow us to get wikibugs in #mediawiki-visualeditor by other means." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53973 [23:03:18] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [23:06:34] PROBLEM - Puppet freshness on xenon is CRITICAL: No successful Puppet run in the last 10 hours [23:07:59] Which is humes counterpart in eqiad? [23:09:20] LeslieCarr: did you get puppet working? (your paste) [23:10:01] grrr, the rdns hasn't fallen out of cache yet [23:10:08] RobH: ^^ [23:10:25] Reedy: terbium? not sure [23:10:34] Reedy: terbium? not sure [23:10:38] Yeah, I got it [23:10:43] Then clicked part instead of copy [23:10:44] ;) [23:10:46] hehe [23:10:46] hah [23:11:50] reedy@terbium:~$ ls -al /usr/local/apache/common [23:11:50] lrwxrwxrwx 1 root root 12 Mar 12 19:50 /usr/local/apache/common -> common-local [23:11:53] All in red [23:12:04] Do we have any servers in eqiad that mwscript works on? :/ [23:12:23] there's an rt or bz or both on that [23:13:48] binasher: so; I'm going to run the centralnotice sql patch now unless you have any reason not to at this particular moment [23:15:11] Hello random apache [23:15:12] reedy@mw1001:~$ sudo -u apache php /usr/local/apache/common/multiversion/MWScript.php eval.php fiwiki [23:15:43] :D [23:15:52] mwalker: go for it [23:18:50] !log Updating CentralNotice schema on testwiki & metawiki with patch-centralnotice-2_3.sql [23:18:57] Logged the message, Master [23:34:39] What could be reasons for "401 Unauthorized" when trying to view any uploaded files in he.wikivoyage? [23:35:45] was reported as https://bugzilla.wikimedia.org/show_bug.cgi?id=46863 and I can reproduce [23:40:04] andre__: Swift sucking? [23:40:06] Fixed anyway [23:40:37] https://upload.wikimedia.org/wikivoyage/he/a/a5/Luxembourg_districts.jpg [23:40:39] WFM [23:40:53] Reedy, heh, now it also works for me again. [23:41:07] Reedy, so did you fix something? [23:41:10] confirmed, was broken earlier, fixed now [23:41:10] By say fixed anyway, I did something to try and fix it ;) [23:41:16] ah [23:41:19] thanks! [23:42:11] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:43:08] andre__: merged: bugzilla_report changes for urgent tickets and your realname on planet [23:43:15] cya [23:43:29] mutante, saw that. big thanks! [23:43:37] np