[00:09:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:45] New patchset: Tim Starling; "Send fatal errors to mxircecho on nfs1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19408 [00:15:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19408 [00:15:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.643 seconds [00:15:51] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19408 [00:50:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:57:26] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [01:04:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [01:06:32] !log restarting apaches to get new wmerrors configuration [01:06:40] Logged the message, Master [01:11:14] PROBLEM - Apache HTTP on srv196 is CRITICAL: Connection refused [01:36:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:31] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 216 seconds [01:40:39] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 233 seconds [01:47:06] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 612s [01:47:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.012 seconds [01:51:27] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [01:52:03] RECOVERY - Apache HTTP on srv196 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [01:55:03] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [01:58:39] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 17s [01:59:15] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 18 seconds [02:14:33] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:21:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:34:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.015 seconds [02:44:36] any op around? need to send a big file [02:46:59] TimStarling: can I send it to you? [02:47:36] isn't there a procedure for that? [02:47:47] don't know [02:52:04] you want to upload it to commons? [02:54:27] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [02:58:46] TimStarling: no, it is a bed file [03:02:39] TimStarling: Most large file uploads involve a Bugzilla ticket and a shell user slurping the content from a public repo, I think. [03:02:55] I don't think there's any formal procedure, though. [03:04:10] what's a bed file? [03:04:52] *deb [03:05:18] how big is big? [03:06:25] 25mb [03:09:27] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [03:10:36] is this a file that you created? or did you download it from opensiddur.org? [03:12:23] New patchset: Tim Starling; "Install LuaSandbox extension on MW apaches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19425 [03:13:05] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19425 [03:20:42] TimStarling: I built a package for the sources there [03:20:55] https://bugzilla.wikimedia.org/show_bug.cgi?id=39324 [03:22:00] can you just send me the debian directory? [03:22:45] sure [03:23:06] any address/server? [03:24:27] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [03:24:46] TimStarling: scp would be best [03:25:05] it should only be a few KB, you can send it by email [03:25:16] 8kb [03:25:24] ts@wikimedia.org? [03:26:35] tstarling@wikimedia.org [03:28:25] sent [03:31:58] you used Open-Siddur-Project-Hebrew-Font-Pack-1.8.zip ? [03:33:39] yes [03:34:33] I removed one non-free font, and updated version for some [03:39:26] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [03:42:52] have you tested it? the directory layout looks unconventional [03:44:28] also the "GNU FreeFont" fonts are duplicates of fonts that are already available on the system [03:45:04] I installed it locally [03:45:41] they aren't duplicate, there are updated version (in case of using 10.04) [03:53:45] so how would rsvg know which copy to use? [03:55:38] TimStarling: re wikimedia-fonts pkg: wouldn't it be better to just get everything we need into debian/ubuntu and use our local repo until they get in upstream? [03:55:51] or is wikimedia-fonts just a meta package? [03:56:35] TimStarling: I prepared a package, you may use it or not. you are the sysadmin [03:57:03] wikimedia-fonts was just in our local repo until recently, when it was removed because the relevant fonts were upstream [03:57:49] so what would be the verdict? [03:58:05] normal fonts packages seem to have a /etc/fonts/conf.d configuration file [03:58:36] can you stick the debian dir in git somewhere? [03:59:02] for wikimedia-fonts? or hebrew-fonts? [03:59:02] * jeremyb can't really look at it today though. maybe in 36 hrs [03:59:07] TimStarling: the new one [03:59:19] I'm not sure if it's usable [03:59:53] I can put it in git but I would need to create a project for it under operations/debs [03:59:56] my point was that you're the only one that can see that right now ;P [03:59:59] and that means knowing what the package will be called [04:00:10] I can attach it to the bug [04:00:14] anyway, bbl [04:00:15] that works [04:00:23] (-> bug) [04:11:11] New review: Tim Starling; "The package seems to be broken on Lucid. Someone installed it on mw8, and it failed to configure, wi..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7349 [04:15:13] Who should I CC on tickets such as https://bugzilla.wikimedia.org/show_bug.cgi?id=39322 ? [04:20:32] depends on whose fault it is [04:23:30] It sounds IPv6 related, so I was guessing network people. [04:23:47] I just couldn't remember who besides Mark was networking related. [04:24:00] Then I wondered if there was a chart. Then I went back to my show. [04:24:53] well, Leslie is also a network person, but it's not really clear if that is a network issue [04:25:04] it would be useful if someone else could confirm it [04:27:06] It isn't clear to me how well IPv6 is supposed to be supported right now on Wikimedia wikis, so I wasn't sure what priority it had. I recall it being mostly experimental. [04:27:18] Though I think https support is also still allegedly experimental... [04:27:38] "unsupported" [04:39:38] Brooke: https is not unsupported [04:39:41] secure is [04:39:45] https is fully supported [04:40:03] AFAIK ipv6 should also be fully supported [04:40:19] I filed a bug. [04:40:51] It's listed as "https services (unsupported)" at http://status.wikimedia.org [04:40:54] are we using apache as our IPv6 frontend? [04:40:55] that's secure [04:40:58] i think watchmouse says that HTTPS is unsupported? [04:41:09] in fact, I'd say we don't support secure at all [04:41:23] hmmm, varnish I guess [04:41:36] TimStarling: nginx is ipv6 frontend [04:41:36] TimStarling: no, it's nginx [04:41:36] Ryan_Lane: You mean secure.wikimedia.org? [04:41:40] Brooke: yes [04:41:50] secure can die in a fire as far as I'm concerned [04:41:52] Via: 1.1 varnish [04:41:56] Ryan_Lane: Right. I'd say that's the case as well. But I'm mostly concerned with status.wikimedia.org. [04:42:04] TimStarling: well, nginx is an ssl termination proxy [04:42:05] Which doesn't make the distinction, as far as I can see. [04:42:07] maybe varnish behind nginx? [04:42:08] it should be transparent [04:42:28] Brooke: and here's the reason I never wanted to list that damn service on that page at all [04:42:33] anyway I get apache errors if I try to use the IPv6 address as a browser proxy [04:42:50] Squid converts them into proper requests with Host headers [04:42:52] Ryan_Lane: Well, maybe it should be removed. :-) But listing it incorrectly is a pretty terrible option to land on. [04:42:54] I guess varnish is not doing that [04:43:04] Brooke: I don't disagree [04:43:15] it should have been listed as secure (unsupported) [04:43:19] And https is not unsupported. You're funny this evening. ;-) [04:43:34] if there's a problem with https, I'll fix it [04:43:53] if there's a problem with secure, I'm not going to put in a ton of effort [04:43:57] Right. I think there's just a labeling problem with status. [04:44:06] Anyway, I filed a bug. Someone will look at it at some point. It's very minor. [04:44:16] which bug? [04:44:29] oh, with status? [04:44:36] https://bugzilla.wikimedia.org/39325 [04:44:38] Yes. [04:45:06] That site (status.wikimedia.org) is only going to grow in prominence. It's important that it be corrected at some point. [04:45:15] Ryan_Lane: have you got a browser with an IPv6 connection handy? [04:45:18] We don't want more and more people to think https is unsupported. [04:45:25] TimStarling: hm. maybe [04:45:32] I guess I could install X on my VPS... [04:45:43] * jeremyb runs away [04:45:55] ugh. seems I don't [04:45:59] damn US ISPs [04:46:10] I thought comcast was supposed to enable it? [04:46:29] just looking for someone to confirm bug 39322 [04:47:08] I tried proxying through an ssh tunnel, but like I say, I just get 404 errors, probably due to the lack of a Host header [04:48:19] and maybe it wouldn't work anyway, maybe it is a problem with the browser's IPv6 support [04:49:03] You could ask orzel if he's having the problem with any other sites. He's in #wikimedia-tech. [05:08:32] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [05:08:32] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [05:38:18] mornin [06:08:48] yes it is [06:13:08] :) [06:14:33] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [06:46:57] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:25:56] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [09:36:53] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [09:39:53] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [09:39:54] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [09:39:54] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [10:57:53] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [11:51:54] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:15:53] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:54:54] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [13:09:53] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [13:24:53] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [13:39:54] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [14:25:12] PROBLEM - Host virt1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:45] RECOVERY - Host virt1005 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms [14:34:39] PROBLEM - SSH on virt1005 is CRITICAL: Connection refused [14:49:20] PROBLEM - Host virt1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:04] morning mr. notpeter, you up and about? [14:51:14] did you get to check out the analytics dell partman stuff at all? [14:59:14] RECOVERY - SSH on virt1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:59:23] RECOVERY - Host virt1005 is UP: PING OK - Packet loss = 0%, RTA = 35.32 ms [15:09:54] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [15:09:54] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [15:10:23] morning RobH! [15:10:31] i'm on to figuring out how to install stat1001 [15:10:50] do you know how to connect to console or boot for install? [15:10:54] i'm looking at t his page right now:http://wikitech.wikimedia.org/view/Dell/PowerEdge_1950_2950#Connecting_to_serial_console [15:13:52] i know that it is an R510 [15:13:59] but there isn't that much info on the R510 page [15:16:54] all the same [15:17:00] connect to root@stat1001.mgmt [15:17:03] console com2 [15:17:08] but... [15:17:16] let's first make sure to put it in the analytics subnet [15:17:26] ok, well i've only done c2100s thus far [15:17:29] and that is ipmi [15:17:31] this is ssh, right? [15:17:32] connect com2 the old ones, console com2 the new ones [15:17:37] i just tried sshing from sockpuppet, and from my local [15:17:57] what is stat1001 going to be used for? [15:18:12] kinda like stat1, various analytics stuff, but also for hosting some sites [15:18:19] the new reportcard will go there [15:18:33] current stats.wikimedia.org will be moved there (I think it is currently on spence :p ) [15:19:29] so hm, this will have a public ip [15:19:43] so foward dns in wikimedia.org? [15:20:00] then you're already set [15:20:02] it already has a public ip [15:20:06] perfect [15:20:06] great [15:20:31] so networking should be ready to auto config [15:20:33] PROBLEM - NTP on virt1005 is CRITICAL: NTP CRITICAL: Offset unknown [15:20:41] need to make sure partman stuff is set up? [15:21:11] or you do it manually [15:21:53] oh that's fine [15:21:54] PROBLEM - Host virt1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:57] i can do that during installer? [15:22:04] since there is only one of these? [15:22:22] yes [15:22:24] k [15:22:28] use LVM [15:22:32] can do [15:22:32] and you have lots of flexibility later [15:22:34] for / too? [15:22:38] just do whatever was done for stat1 really [15:22:46] aye [15:22:49] not necessarily, get 30 GB for / and you're good [15:22:55] ok cool, yeah that's what i'd do [15:22:58] cool [15:23:06] PROBLEM - Host virt1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:47] will do mirrored 30G raid on /, and maybe raid 10 for rest of /a with lvm [15:24:10] according to the r510 wikitech page, this has 12 drives so ja [15:24:12] um, so, ok [15:24:20] yeah, how do I access the console, or tell it to boot installer? [15:24:22] ssh? [15:24:29] and always keep free space in LVM [15:24:32] yeah [15:24:36] at least 10%, but normally, just use what you need [15:24:37] will keep like 20% free or something [15:24:38] yeah [15:24:39] increasing is easy, decreasing is not [15:24:42] aye yeah [15:24:49] yes, ssh [15:24:54] from sockpuppet? [15:24:54] RECOVERY - Host virt1008 is UP: PING OK - Packet loss = 0%, RTA = 35.81 ms [15:24:58] from anywhere [15:25:17] hm, oh host key changed on sock puppet, hang on [15:26:29] ah cool in [15:27:00] RECOVERY - Host virt1007 is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms [15:27:48] hm, so how do I tell it to boot install? [15:28:31] racadm serveraction powercycle [15:28:36] then F12 in bios [15:28:43] hmk [15:28:44] there's also a command to have it pxe boot once, I forget what it is [15:28:50] rob would know [15:29:42] is that from the admin cli or from console after console com2? [15:29:46] console [15:30:17] hmmm, console seems to not really respond much [15:30:23] it goes into console mode [15:30:29] but doesn't respond [15:36:10] hm, i would rather tell it to boot [15:36:14] since my F buttons are all mapped to stuff [15:36:19] not sure how to send a real F132 [15:36:22] F12 [15:36:23] but [15:36:31] console doesn't respond for me at all [15:36:57] oh, btw, I don't think I have a wikitech account [15:37:02] should I? [15:38:40] ottomata: use escape-shift-2 [15:38:48] yea [15:39:06] k, i'm out [15:39:07] ~. [15:39:10] i think worked [15:39:18] RECOVERY - NTP on virt1005 is OK: NTP OK: Offset -0.03916990757 secs [15:42:48] hm [15:42:51] yeah i just got a black screen [15:42:54] couldnt' type anything [15:44:32] should I try again? [15:46:38] nope [15:46:52] whats happening? [15:47:32] no [15:47:36] i get to the admin cli [15:47:49] and then do [15:47:51] console com2 [15:48:13] I see 'connected to serial device...' [15:48:15] for a second [15:48:16] then it clears [15:48:33] or, rather, a buncha blank lines [15:48:39] but then doesn't accept input [15:48:58] racadm from admin cli? [15:49:56] those are my boot commands, no? [15:50:07] racadm config -g cfgServerInfo -o cfgServerBootOnce 1, etc [15:50:20] cool [15:51:13] ok cool it is booting [15:51:23] are those commands in wikitech somewhere? [15:51:52] ah yup! [15:51:53] build a new server [15:51:55] perfect [15:51:55] thank you [15:52:35] cool, sorry, should've referred to that doc when I started a bit ago, all this info is there [15:52:47] i had seen it before, but had been using the c2100 specific pages for the analytics machines [15:52:56] so was looking for dell r510 specific pages for this [15:53:03] ottomata: the normal R series are a lot easier to deal with than the C series [15:53:14] ayyye cool [15:53:21] +1 [15:53:40] i mean, i dont hate them as much as say, our old supermicros [15:53:42] great, and the remaining analytics are r310s [15:53:48] ottomata: yep, all your misc [15:53:57] 'misc' hosts, or whatever you guys called them [15:58:44] hm, ok [15:59:27] its hanging now [15:59:57] https://gist.github.com/3350483 [16:00:06] been sitting at that last bit for 5 mins now [16:05:03] it's not doing anything [16:05:12] "no boot device available" [16:05:16] so you didn't select pxe boot [16:08:18] ottomata: the boot you set with [16:08:22] racadm config -g cfgServerInfo -o cfgServerBootOnce 1 [16:08:22] racadm config -g cfgServerInfo -o cfgServerFirstBootDevice PXE [16:08:26] but that ONLY does the next boot [16:08:35] if you reboot twice, the second reboot goes back to the defaults (disk first) [16:09:10] hard to read the output since it writes over itself [16:09:40] i tend to ctrl+k to clear backlog a lot in serial console [16:09:57] since it doesnt scroll the terminal, just writes over itself [16:15:53] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [16:19:44] New patchset: Pyoungmeister; "fixing transitional nagiso breakage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19497 [16:20:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19497 [16:21:04] RobH: apple-r works for that [16:21:25] mark: during the actual serial console output, your screen scrolls down? [16:21:37] i know after i exit serial console apple+r sends a reset to terminal to work right [16:21:50] but i will try during post next time =] [16:21:52] i press it multiple times during boot [16:21:58] especially when the colors fuck up [16:22:02] but i believe it also works for scrolling [16:22:49] coolness, will have to give it a shot, would be nice to keep a scrolling backlog of the serial redirection [16:27:36] sorry, was grabbing a sammy [16:27:44] um, i think I did only tell it to pxe boot once [16:27:46] will try again [16:27:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19497 [16:38:54] hm so [16:39:00] i was getting more dots and spinning slashes! [16:39:01] DHCP.....\ [16:39:07] but it has now stopped moving [16:39:25] mark,RobH,cmjohnson1 ^ [16:39:48] is this wrong then? [16:39:49] Nopboothdevice1available.nteltCorporation v6.0.11 [16:39:49] CurrentmbootemodeeisXset.tovBIOS [16:39:52] so it never loads up with a ip address ? [16:39:57] i guess not [16:40:04] what machien ? [16:40:11] i can check out the tubes and stuff [16:40:20] leslietubechecker on duty! [16:40:48] oh, shit i did promise i would only work on fundraising stuff today [16:40:53] let's pretend it's a fundraising machine [16:40:56] ;) [16:42:00] ahha, is it 78:2b:cb:35:c4:f7 ? [16:42:44] so, if you want to know what's up, i went to brewster (the installation server) and checked out /var/log/syslog [16:43:30] when you see a line like "dhcpd: DHCPDISCOVER from 78:2b:cb:35:c4:f7 via 208.80.154.131: network 208.80.154.128/26: no free leases" then it usually means that machine's mac address isn't configured in one of the "linux-host-entries" files (ottomata ^^) [16:43:38] hmmm [16:43:55] there is an entry [16:43:56] but it is [16:44:00] host stat1001 { [16:44:01] hardware ethernet 78:2b:cb:1e:79:dd; [16:44:27] ok, so how do I check MACs on the r510? [16:44:41] on the management you can check out i think it's "racadm getsysinfo" [16:44:55] k [16:44:58] (and was the mobo changed on this machine to fix it?) [16:46:16] weird, I don't see 79:dd in the sysinfo at all [16:46:19] dunno where that came from [16:46:29] i do se c4:f7 [16:46:32] will chage in host entries [16:47:53] it could have been a motherboard swap ? [16:48:00] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [16:48:07] New patchset: Ottomata; "Fixing MAC addy for stat1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19499 [16:48:14] aye oh yeahhh [16:48:19] they did have hw issues with this machine [16:48:20] cool [16:48:24] think they did a swap [16:48:40] can you approve that one? [16:48:44] https://gerrit.wikimedia.org/r/19499 [16:48:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19499 [16:49:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19499 [16:49:18] approved :) [16:49:23] danke! [16:49:29] are you going to merge on sockpuppet then update brewster ? [16:52:47] done already [16:52:58] booting now [16:53:23] i hope this works! [16:53:30] i've been trying to install OSes on servers for a week now! [16:53:34] haven't yet had a success! :p [16:55:44] looking better! [17:00:23] should I see more than just sda when I am doing manual partitioning during boot? [17:00:48] ottomata: so that system is a raided system [17:00:52] it has 12 disks, but in a raid10 [17:01:08] if you all dont need raid10 performance, you can get quite a bit more space by going raid6 [17:01:08] already? [17:01:10] oh hw raid! [17:01:12] ohhhhhh [17:01:30] 6TB as is [17:01:32] that should be fine [17:01:35] so you may wanna change the raid to raid6 if you dont need the high performance writes for raid10 [17:01:35] cool even easier [17:01:41] but otherwise raid10 is fastest [17:02:45] New patchset: Pyoungmeister; "adding real password lookups for vumi" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19500 [17:03:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19500 [17:06:30] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19500 [17:09:30] mark, i see on stat1, root is LVM [17:09:36] is that ok? [17:09:43] should I do that for stat1001 [17:09:44] ? [17:09:46] New patchset: Ryan Lane; "Rework openstack manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [17:09:48] paravoid: ^^ [17:09:54] I'm going to −2 this for now [17:10:21] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19501 [17:10:27] New review: Ryan Lane; "Glance's configuration is wrong between diablo and glance. This needs to be fixed before being pushe..." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/19501 [17:10:49] New patchset: Pyoungmeister; "silver and zhen: wikidev group needed to have a user." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19502 [17:11:25] holy crap [17:11:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19502 [17:11:30] heh [17:11:36] have I told you how I hate mega-patches? :) [17:11:38] I told you it was large ;) [17:11:49] would you rather 15 smaller patches? [17:11:54] definitely yes [17:11:58] bleh [17:12:09] that's the normal git workflow too [17:12:10] that would take forever and a day to code review [17:12:56] that's because gerrit's broken, let's not go over this again :) [17:13:18] Why would it take more forever than reviewing the one giant patch? [17:13:27] I guess if it's a series of coupled patches... [17:13:37] because I'd need to push in a change, wait for a review and push in another [17:13:40] andrewbogott: because noone is able to review this, so we'll just +2 it? :P [17:13:45] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19502 [17:13:48] er, what? [17:13:48] <^demon> Gerrit doesn't force you to use mega-patches, don't blame it :) [17:13:50] if I push in 10, then each one needs to be reviewed individually [17:13:55] Well, yeah, obviously reviewing is slower than not reviewing :) [17:14:23] In the short-term at least :/ [17:14:34] if there's something wrong with an individual change, I'd need to push in a follow up [17:14:40] then *it* would need to be reviewed [17:14:46] and each step of the way I'd be waiting [17:15:00] You can push a long, dependant patchset to gerrit. [17:15:09] this isn't any less reviewable than 10 separate patches [17:15:09] As long as reviewers review them in order it's not much of a problem. [17:15:21] (If reviewers review the last patches before the first, then it gets messy.) [17:15:55] <^demon> Ryan_Lane: How about 18 identical patches? https://gerrit.wikimedia.org/r/#/q/status:open+operations+owner:Demon+branch:master,n,z :) [17:16:05] heh [17:16:12] ^demon: I see what you did there [17:16:12] <^demon> (need approval, hint hint) [17:16:23] <^demon> :D [17:16:35] Change merged: Ryan Lane; [operations/dns] (master) - https://gerrit.wikimedia.org/r/18679 [17:16:49] Change merged: Ryan Lane; [operations/debs/wikimedia-search-qa] (master) - https://gerrit.wikimedia.org/r/18676 [17:16:50] ^demon: and what would happen if we changed gerrit's SSH port? [17:17:00] <^demon> Lots and lots of updates :) [17:17:00] another 18 commits? [17:17:01] Change merged: Ryan Lane; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/18675 [17:17:14] Change merged: Ryan Lane; [operations/debs/wikimedia-ldap-tools] (master) - https://gerrit.wikimedia.org/r/18674 [17:17:22] <^demon> Alternatively, fix git-review to be a little less braindead. [17:17:25] Change merged: Ryan Lane; [operations/debs/wikimedia-keyring] (master) - https://gerrit.wikimedia.org/r/18673 [17:17:37] Change merged: Ryan Lane; [operations/debs/wikimedia-job-runner] (master) - https://gerrit.wikimedia.org/r/18672 [17:17:40] sigh [17:17:46] Change merged: Ryan Lane; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/18671 [17:17:56] <^demon> I don't use git-review personally, but people seem to like it. [17:17:59] Change merged: Ryan Lane; [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/18670 [17:18:01] * andrewbogott back soon [17:18:04] <^demon> I'd love to kill the .gitreview file entirely. [17:18:11] Change merged: Ryan Lane; [operations/debs/udp2log-log4j-java] (master) - https://gerrit.wikimedia.org/r/18669 [17:18:21] Change merged: Ryan Lane; [operations/debs/testswarm] (master) - https://gerrit.wikimedia.org/r/18668 [17:18:25] yeah, I completely disagree with this useless replication of info across our repos [17:18:29] Change merged: Ryan Lane; [operations/debs/search-qa] (master) - https://gerrit.wikimedia.org/r/18667 [17:18:29] but I guess it's too late now. [17:18:36] replication of info? [17:18:42] I guess some of it is [17:18:55] Change merged: Ryan Lane; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/18666 [17:18:56] <^demon> Ideally I want to hook into repo creation. [17:19:01] <^demon> So this is done at repo & branch creation time. [17:19:05] Change merged: Ryan Lane; [operations/debs/puppet] (master) - https://gerrit.wikimedia.org/r/18665 [17:19:13] <^demon> Or better, fix git-review to not need it at all ;-) [17:19:13] paravoid: you're way too much of a purist ;) [17:19:16] <^demon> Which probably easier. [17:19:31] I also find hardcoding the "central" server in the repo to be very unDVCS-like [17:19:34] Change merged: Ryan Lane; [operations/debs/nodejs] (master) - https://gerrit.wikimedia.org/r/18664 [17:19:45] it's only hardcoded in for git review [17:19:53] and you can override that on the commandline [17:19:54] <^demon> paravoid: Again, fix git-review to operate off of real remotes rather than needing a silly file. [17:20:06] Change merged: Ryan Lane; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/18663 [17:20:07] how about not using git-review if it's broken? [17:20:08] <^demon> This doesn't hurt anyone who's not using git-review anyway [17:20:20] ^demon: I think git-review may actually have real remote support now, but if you're using a github mirror you need that file [17:20:21] Change merged: Ryan Lane; [operations/debs/ircd-ratbox] (master) - https://gerrit.wikimedia.org/r/18662 [17:20:36] That's why the .gitreview file exists, because git-review was designed for the OpenStack use case where the repo is mirrored to github [17:20:38] how about the people that like it fix it rather than me? :) [17:20:50] Change merged: Ryan Lane; [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/18661 [17:21:01] Change merged: Ryan Lane; [operations/debs] (master) - https://gerrit.wikimedia.org/r/18660 [17:21:08] ^demon: done [17:22:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19503 [17:22:23] as for being a purist, I'm sorry but I just can't review a huge megapatch that has a comment "rework manifests" [17:22:28] call me stupid if you want [17:22:59] * Ryan_Lane shrugs [17:23:16] I tried. it's completely unreadable. [17:23:33] I don't understand what half of the changes do, because there's no commit message to explain them [17:26:48] Ryan_Lane: Stacked changes in Gerrit aren't that terrible (as long as you know git well enough to rebase things right), and I don't buy your argument for how they slow down the review process [17:27:11] this change wouldn't make *sense* as a bunch of changes [17:27:19] New review: Demon; ".gitreview already handled in I23c4218d, no need for it here." [operations/deployment] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/8732 [17:27:28] unless people want to review obviously broken manifests [17:27:37] and then, what's the fucking point/ [17:27:43] so that it's in smaller chunks? [17:27:58] it's the same amount of code that needs to be reviewed [17:28:50] maybe I could have put the ldap changes into a different change, but it would have been pretty difficult [17:28:55] hmmm, mark, uh oh [17:28:56] Unable to install GRUB in /dev/sda │ │ [17:28:56] │ │ Executing 'grub-install /dev/sda' failed. [17:29:07] i made / LVM ext4 and set it as root [17:29:24] this is the problem with spaghetti code [17:29:29] True [17:29:48] the only reason my OSM change was huge the other day is because I changed out the API [17:30:00] RECOVERY - Puppet freshness on spence is OK: puppet ran at Tue Aug 14 17:29:51 UTC 2012 [17:30:05] difficult to change the API without changing a bunch of the rest of the system [17:30:34] if I would have kept EC2 api support I could have done it iteratively [17:30:48] * RoanKattouw thinks back to that change and shuddres [17:30:52] That was hours of review [17:30:52] heh [17:30:57] yeah. it was a large change [17:31:01] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [17:31:04] don't expect another one like that [17:31:36] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19501 [17:31:50] I won't be changing out the api again, so most changes will be small [17:31:59] paravoid: is the commit message better now? [17:31:59] Hmm [17:32:00] root@ms7 # du -sh /root/badownershipfiles [17:32:02] 174M /root/badownershipfiles [17:32:04] root@ms7 # wc -l /root/badownershipfiles [17:32:05] 1858813 /root/badownershipfiles [17:32:12] That's a lot of files with bad perms [17:33:13] !log Yesterday's find on ms7 is done, found 1.8 million files with bad ownership. Will run a batch chown on them soon [17:33:23] Logged the message, Mr. Obvious [17:33:27] ok, one thing that makes this change crappy is the renames [17:33:28] AaronSchulz: ---^^ [17:35:08] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [17:35:37] RoanKattouw: find . -uid .. | xargs chown .. should be faster than find . -uid .. -exec chmod ... [17:35:47] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19501 [17:35:52] New review: Dereckson; "-shellpolicy ; stewards comments led to know there is no active community and change is trivial." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/13427 [17:35:56] mutante: I'm doing that, indirectly [17:36:09] I did a find and dumped it into a file. The find took 6-8 hours to run [17:36:23] Next, I'll pipe the file into xargs chown or similar [17:36:32] RECOVERY - Host ms10 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:36:32] RECOVERY - Host ms10 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:37:40] pooper scooopers [17:38:17] RoanKattouw: yep yep, just saying -exec would just do one file at a time [17:38:22] RobH, have you had trouble with installing GRUB when partioning manually? [17:38:31] notpeter: I'm still getting remote host identification errors for srv194 and srv281, FYI [17:38:39] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [17:38:43] ottomata: nope, not in a hardware raid like that [17:38:52] mutante: Yeah, I'll have to find some compromise there. xargs-ing in 1.8 million arguments seems excessive [17:39:04] THE failing step is: Install the GRUB boot loader on a hard disk [17:39:08] hm [17:39:15] RoanKattouw: yeah. working on fixin' it. would you rather I remove those from the dsh group until they're properly up? [17:39:17] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19501 [17:39:22] i thoguth maayybe it didn't like / as LVM, so I put it on a physical partition [17:39:24] but same deal [17:39:40] RoanKattouw: "How long can the argument list to xargs be? It depends on the system, but xargs --show-limits will tell us. " [17:39:43] notpeter: I don't mind the noise, just making sure you're aware [17:39:53] RoanKattouw: cool, thanks :) [17:40:20] mutante: haha, 2M characters [17:41:57] RoanKattouw: heh, i dunno , but 'Size of command buffer we are actually using: 131072' [17:42:22] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [17:43:00] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19501 [17:44:02] cmjohnson1, have you seen that before? [17:44:08] fail installing GRUB on hw raid? [17:46:24] I really need to get a local lint checker [17:46:38] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [17:47:17] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19501 [17:53:28] you are marking root (or whatever) as bootable right? ottomata (sorry if that's a stupid q but I just forgot to do this the other day on a reinstall and it digned me exactly like that) [17:53:49] O, maybe I didn't, lemme check! [17:54:02] *dinged [17:58:05] bwer, apergos, no change [17:58:10] I had not set root as bootable, but I just did [17:58:17] still couldnt' install grub [17:59:03] hm [17:59:04] thats Bootable flag on, right? [17:59:11] yes indeed, something has to be bootable [17:59:13] :-D [17:59:35] so i've got / as a 30GB bootable ext4 volume as sda1 [17:59:46] sda2 is LVM with 1 vg and 2 lvs [17:59:55] one swap, the other /a [18:00:08] i think it looks pretty good... [18:00:36] swap on an lvm? I guess... [18:01:15] how far along in the install are you? I mean, if you start if from the top it's about 10 minutes or so? [18:01:25] why not? easy to change size of swap then [18:01:31] because maybe it's worth doing that, you can [18:01:39] are you planning on doing a lot of swapping? [18:01:39] hm, i'm at the installing grub step, what just start over? [18:01:52] no i dunno, just a habit, and stat1 is set up that way too [18:01:56] actually stat1 has / on LVM too [18:02:21] well you can keep the partition table next time through, just make sure you set all the file and usage types and mount points and the boot flag again [18:02:38] *shrug* might be worth a try [18:02:46] hm, ok [18:06:05] New patchset: Dereckson; "(bug 38036) Set si.wikipedia and si.wiktionary favicons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/13427 [18:06:45] New review: Dereckson; "Patch rebased to master." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/13427 [18:14:15] !log stopping puppet on brewster for partman experimentation [18:14:24] Logged the message, notpeter [18:38:46] yay apergos! good advice! [18:38:48] it all worked this time [18:38:52] sweeeeet [18:38:58] now i've got a prompt on the console [18:39:02] do I need a root pw? [18:39:13] puppet hasn't put my users key there yet, I'm sure, so how do I log in? [18:39:19] um [18:39:27] do ou have root on the cluster? [18:39:48] do you have root on the cluster? [18:40:43] ordinarily one sshes over to sockpuppet with key forwarding, and does puppet-related things including first access, which uses an ssh key available from there [18:41:01] * RoanKattouw fumes [18:41:07] but if you do not have root on the cluster this is not going to work for you, and someone else would I guess have to take over from there [18:41:08] Why is Solaris such a piece of garbage [18:41:11] hahaha [18:41:19] Ariel, you've felt my pain [18:41:29] no, you have felt mine [18:41:30] Is there something like top on ms7? [18:41:31] finally :-D [18:41:35] That is true [18:41:39] er [18:41:51] oh how I remember *nothing* about solaris now... [18:42:26] ottomata: stat1001? you want initial puppet? [18:42:40] can't you just ps ef and look for whatever ? :-P :-D [18:42:55] ohhhhhhhh [18:42:57] right of course [18:43:07] hm [18:43:18] Oh, ps ef looks sort of useful [18:43:20] didn't realize my ssh key would be authorized on a new install [18:44:03] ottomata: it needs requesting and signing certificates between puppet and puppetmaster, i can do that for you [18:44:11] so build a new server talks about the puppet piece [18:44:14] yeah [18:44:20] naw i can do it i thikn [18:44:22] i want to learn [18:44:23] go read that little paragraph and you'll see what we're talking about [18:44:30] i've set up puppet before [18:44:30] so ja [18:44:36] buuuut, i need to log into the machine first [18:44:39] that's the part i'm stuck on [18:44:45] uh huh, it describes how that happens [18:44:45] trying to ssh root@stat1001.wikimedia.org [18:44:48] Build_a_new_server#Get_puppet_running [18:44:56] ottomata: <-- wikitech page [18:45:02] OHH [18:45:04] ok okok [18:45:22] rtfm ok ok ok ok [18:45:25] reading :) [18:45:26] ottomata: [sdb] Attached SCSI disk [18:45:26] if you don't have the creds to do that, someone will have to step in for that bit. [18:45:42] are the disks in the analytics boxes scsi? I'm confuseled [18:45:48] ottomata: also involves repeating stuff that fails on first attempt but is normal :) [18:45:51] Aaaaalright [18:45:53] Here goes [18:45:57] nice! [18:45:58] RoanKattouw: prstat? [18:46:11] notpeter, i unno! [18:46:17] mutante: ZOMG thanks man [18:46:18] mutante, apergos, thanks! i should have creds [18:46:26] ottomata: kk [18:46:40] is this the chmod? [18:46:44] ( roan) [18:49:17] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19509 [18:49:34] apergos: Yeah about to run that. 1.8M files [18:49:43] it should be just fine [18:50:59] Running them through chown in batches of 1000 [18:51:10] New patchset: Ottomata; "Adding stat1001 to site.pp. Reorganizing role::statistics." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19512 [18:51:34] I'm not actually quite sure whether the overhead of running a chown process for each file is worse than the overhead of doing the string concatenation in bash, but whatever [18:51:51] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19512 [18:51:59] !log Running /root/fixownership < /root/badownershipfiles in a screen on ms7 [18:52:08] Logged the message, Mr. Obvious [18:52:24] neither of them should have an impact [18:53:08] 10 pm and I did not really eat. uh oh [18:53:57] New patchset: Ottomata; "Adding stat1001 to site.pp. Reorganizing role::statistics." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19512 [18:54:20] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:54:21] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:54:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19512 [18:56:35] May someone wants to merge https://gerrit.wikimedia.org/r/#/c/19399/ please, I want to get that finished ;) [18:57:19] mutante, could you approve https://gerrit.wikimedia.org/r/19512 [18:57:19] for me? [18:57:23] i'll run puppet after that [18:57:30] (i can merge on sockpuppet) [18:58:08] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19399 [18:58:41] thanks, RoanKattouw :) [18:59:53] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms [18:59:54] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms [19:00:56] ottomata: eh, i think it should be the other way around. move stuff out of site.pp into role classes, that was the point [19:02:06] yeah, but those things weren't relevant for both stat1001 and stat1 [19:02:12] if they are i'll put them back into the role [19:03:17] do you want 2 separate roles ? [19:03:27] if they are not all the same [19:04:20] i dunno, shouldn't a role only be for something that will have more than one place? [19:04:57] ezachte only needs one mediawiki checkout to compute code string translations stats, etc. [19:04:59] i think roles are also preferred over site.pp if there is just a single server that uses it [19:05:07] and there might always be more in the future? [19:05:15] even if there is only single server? [19:05:34] i unno, seems really redundant to me, no?that's kinda like [19:05:36] i could imagine it doesn't stay like that [19:05:47] node serverA { include role::serverA } [19:06:02] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:06:03] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:06:25] so will stat1 stay different from stat1001 forever? [19:06:44] any of those things that do end up being on both, i'd put back in role::statistics [19:06:47] but in general, yes [19:06:54] stat1001 will be more of a public facing web site host for stats stuff [19:07:01] stat1 will remain general number crunching machine [19:07:12] i think you want another one then, called statistics-crucher or whatever fits [19:07:40] so i want: [19:07:57] role::statistics, role::statistics-cruncher, role::statistics-www (or whatever) [19:07:57] then [19:08:02] yeah, put all things they have in common in statistics and additional stuff in these [19:08:25] stat1 { include role::statistics, role::cruncher} stat1001 { include role::statistics, role::cruncher } [19:08:35] or maybe role:statistics::cruncer and role:statistics::www [19:08:40] sorry last bit :www [19:08:49] aye FIIIIIIIIIIINe :p [19:08:57] can I put them both in roles/statistics.pp [19:08:57] ? [19:09:03] yea [19:09:23] just did not want to add back multiple classes in site.pp [19:09:42] how about this! [19:09:45] role::statistics::cruncher inherits role::statistics [19:10:05] yea, that sounds good [19:15:18] New patchset: Ottomata; "Adding stat1001 to site.pp. Reorganizing role::statistics into ::cruncher and ::www." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19512 [19:15:35] mutante^ [19:15:57] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19512 [19:21:08] bwah [19:21:11] PROBLEM - NTP on ms-be1003 is CRITICAL: NTP CRITICAL: No response from NTP server [19:21:12] PROBLEM - NTP on ms-be1003 is CRITICAL: NTP CRITICAL: No response from NTP server [19:22:58] New patchset: Ottomata; "Adding stat1001 to site.pp. Reorganizing role::statistics into ::cruncher and ::www." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19512 [19:23:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19512 [19:24:00] ok, mutante^ [19:26:36] Logged the message, Master [19:28:59] New patchset: Pyoungmeister; "vumi: correcting password lookup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19520 [19:29:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19520 [19:29:46] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19520 [19:30:11] PROBLEM - Lucene disk space on search23 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:30:12] PROBLEM - Lucene disk space on search23 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:30:38] PROBLEM - SSH on search23 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:39] PROBLEM - SSH on search23 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:02] PROBLEM - Host search23 is DOWN: PING CRITICAL - Packet loss = 100% [19:36:03] PROBLEM - Host search23 is DOWN: PING CRITICAL - Packet loss = 100% [19:37:50] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [19:37:51] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [19:37:59] RECOVERY - SSH on search23 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:38:00] RECOVERY - SSH on search23 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:38:08] RECOVERY - Host search23 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [19:38:09] RECOVERY - Host search23 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [19:39:03] RECOVERY - Lucene disk space on search23 is OK: DISK OK [19:39:03] RECOVERY - Lucene disk space on search23 is OK: DISK OK [19:40:50] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [19:40:50] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [19:40:50] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [19:40:51] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [19:40:51] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [19:40:51] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [19:41:24] New patchset: Pyoungmeister; "vumi: include password class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19590 [19:42:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19590 [19:43:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19590 [19:44:10] It looks like we're going to overshoot our deployment window, but no one else is scheduled until 3. Anyone have a problem with that? [19:45:32] notpeter, can you approve this for me? [19:45:33] https://gerrit.wikimedia.org/r/#/c/19512/ [19:46:12] ja [19:46:20] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19512 [19:46:25] done [19:47:55] danke [19:53:15] woooo puppet running on stat1001 [19:58:48] ottomata: re, i was out for food. i see its merged already. alright, nice [19:59:11] yup! got one more coming in, gotta remove something not common to both from statistics::base [19:59:53] New patchset: Ottomata; "Removing dataset2 mount from misc::statistics::base" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19640 [20:00:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19640 [20:03:43] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19640 [20:04:09] danke [20:04:22] de rien [20:06:12] yeehaawwwwwwwwww [20:06:14] it works! [20:06:18] my first succesful install! [20:08:14] !log installed precise and puppetized stat1001 [20:08:22] Logged the message, Master [20:09:20] ok, those are in the ES cluster [20:09:28] so they cannot share a rack with anything more than 1 other ES server [20:09:40] no more than two ES servers in a rack, so lemme see [20:10:07] yea, that looks fine to me, if you dont have room in sdtpa [20:10:38] ahh, all the ones with space in sdtpa seem to have ES servers already ;] [20:10:48] some people are reporting loading issues, especially with watchlists in wikimedia-tech [20:11:13] I'm going to hold off on finishing my deployment until someone can confirm nothing is broken right now on en.wiki [20:11:19] I haven't actually deployed to en.wiki yet [20:11:33] ironically, my deployment affects watchlists [20:11:37] cmjohnson1: So yea, c2-pmtpa looks good [20:11:45] since there are already 2u servers in it and all [20:12:57] looks like something caused a spike: https://gdash.wikimedia.org/dashboards/article/ [20:16:32] watchmouse is complaining about irc.wm.o fwiw [20:17:43] I can't see any problems myself, guess I'll go ahead and proceed... [20:20:37] Logged the message, Master [20:27:58] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [20:28:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [20:31:59] RECOVERY - Puppet freshness on ms-be1003 is OK: puppet ran at Tue Aug 14 20:31:37 UTC 2012 [20:31:59] RECOVERY - Puppet freshness on ms-be1003 is OK: puppet ran at Tue Aug 14 20:31:37 UTC 2012 [20:34:23] RECOVERY - swift-object-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [20:34:23] RECOVERY - swift-object-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [20:34:41] RECOVERY - swift-container-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [20:34:41] RECOVERY - swift-container-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [20:34:50] RECOVERY - swift-object-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [20:34:50] RECOVERY - swift-object-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [20:34:59] RECOVERY - swift-object-auditor on ms-be1003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [20:35:00] RECOVERY - swift-object-auditor on ms-be1003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [20:35:08] RECOVERY - swift-container-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:35:08] RECOVERY - swift-account-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [20:35:08] RECOVERY - swift-container-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:35:08] RECOVERY - swift-account-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [20:35:17] RECOVERY - swift-container-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [20:35:17] RECOVERY - swift-object-server on ms-be1003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [20:35:17] RECOVERY - swift-container-server on ms-be1003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [20:35:17] RECOVERY - swift-container-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [20:35:17] RECOVERY - swift-object-server on ms-be1003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [20:35:17] RECOVERY - swift-container-server on ms-be1003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [20:35:26] RECOVERY - swift-account-reaper on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:35:26] RECOVERY - swift-account-server on ms-be1003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [20:35:26] RECOVERY - swift-account-reaper on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:35:26] RECOVERY - swift-account-server on ms-be1003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [20:35:29] o.0 is it just me or is there 2 nagios bots? [20:35:38] s/is there/are there/ [20:36:03] Damianz: There are for some time now [20:36:17] Hmm [20:36:18] maybe the backup one would shut up, if devoiced [20:36:24] * Damianz stabs nagios-wm_ with her own underscore [20:36:54] just kick it [20:37:22] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [20:37:22] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [20:37:49] PROBLEM - LDAPS on virt0 is CRITICAL: Connection refused [20:37:49] PROBLEM - LDAPS on virt0 is CRITICAL: Connection refused [20:38:12] !log killed duplicate nagios-wm [20:38:20] Logged the message, Master [20:38:28] :D [20:38:29] thats cause puppet runs on spence were temp. broken and now work again and it was started manually before [20:38:43] RECOVERY - swift-account-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [20:38:53] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.003 second response time on port 389 [20:39:04] mark: around? [20:39:10] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.001 second response time on port 636 [20:39:10] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 35.28 ms [20:40:49] time for scap [20:46:31] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [20:47:16] RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:47:25] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 35.31 ms [20:51:46] RECOVERY - NTP on ms-be1003 is OK: NTP OK: Offset -0.01493740082 secs [20:59:25] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [21:09:37] PROBLEM - NTP on ms-be1005 is CRITICAL: NTP CRITICAL: Offset unknown [21:11:08] RECOVERY - NTP on ms-be1005 is OK: NTP OK: Offset -0.01434135437 secs [21:11:08] AaronSchulz: FYI I started running those chowns around noon, it's a large number of files (1.85 million) so it'll probably finish around midnight [21:11:54] New patchset: Dzahn; "recommissioning zirconium (RT-3401)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19645 [21:12:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19645 [21:14:07] k :) [21:14:52] RECOVERY - swift-account-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:15:01] RECOVERY - swift-account-reaper on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:15:10] RECOVERY - swift-object-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:15:19] RECOVERY - swift-container-auditor on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:15:28] RECOVERY - swift-container-server on ms-be1005 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:15:29] RECOVERY - swift-container-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [21:15:29] RECOVERY - swift-account-auditor on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:15:37] RECOVERY - swift-object-updater on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [21:15:46] RECOVERY - swift-object-auditor on ms-be1005 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [21:15:47] RECOVERY - swift-object-server on ms-be1005 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:15:47] RECOVERY - swift-account-server on ms-be1005 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:15:55] RECOVERY - swift-container-updater on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:52:27] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [21:52:28] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [22:16:27] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:16:28] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:19:56] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [22:20:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [22:23:21] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:21] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:21] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:21] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:21] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:22] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:22] PROBLEM - Host ms-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:22] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:22] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:22] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:22] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:22] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:23] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:23] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:23] PROBLEM - Host ms-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:24] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [22:24:24] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 35.49 ms [22:24:24] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 35.48 ms [22:24:24] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 35.60 ms [22:24:24] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 35.56 ms [22:24:24] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 35.32 ms [22:24:25] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 35.58 ms [22:24:25] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 35.44 ms [22:24:25] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 35.49 ms [22:24:25] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 35.48 ms [22:24:25] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 35.60 ms [22:24:25] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 35.56 ms [22:24:25] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 35.32 ms [22:24:26] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 35.58 ms [22:24:26] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 35.44 ms [22:24:33] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 35.35 ms [22:24:34] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 35.35 ms [22:51:58] New patchset: Bhartshorne; "adding swift-drive-audit cronjob to monitor failed disks on swift storage nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19653 [22:52:38] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19653 [22:52:41] anybody want to give me a syntax / puppet check? paravoid maybe? ^^^ [22:52:53] omph. didn't pass lint cehc. [22:52:58] check. that's a good place to start. [22:54:12] New patchset: Bhartshorne; "adding swift-drive-audit cronjob to monitor failed disks on swift storage nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19653 [22:54:50] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19653 [22:56:58] New patchset: Bhartshorne; "adding swift-drive-audit cronjob to monitor failed disks on swift storage nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19653 [22:57:35] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/19653 [22:58:54] New patchset: Bhartshorne; "adding swift-drive-audit cronjob to monitor failed disks on swift storage nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19653 [22:59:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19653 [23:00:28] New patchset: Bhartshorne; "adding swift-drive-audit cronjob to monitor failed disks on swift storage nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19653 [23:01:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19653 [23:01:17] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19653 [23:03:41] New patchset: Bhartshorne; "removing dependency that's already filled by the dependency on the storage service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19656 [23:04:22] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19656 [23:07:50] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [23:08:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [23:11:16] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [23:11:17] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [23:12:18] New patchset: Bhartshorne; "updating the swift-drive-audit version with local enhancements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19686 [23:12:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19686 [23:13:06] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19686 [23:25:20] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [23:25:21] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [23:37:29] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [23:37:29] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [23:38:50] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [23:38:50] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [23:40:29] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [23:40:29] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [23:51:40] New patchset: Catrope; "Point VE to the Parsoid on wtp1 rather than cadmium" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19689 [23:52:26] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19689