[00:03:10] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.36 ms [00:06:19] PROBLEM - SSH on virt1001 is CRITICAL: Connection refused [00:27:37] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:22:41] PROBLEM - Host mw1006 is DOWN: PING CRITICAL - Packet loss = 100% [01:48:11] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [01:48:47] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 613s [01:51:02] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [01:51:56] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:54:47] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 13s [01:55:05] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [02:48:00] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.43 ms [02:56:51] RECOVERY - Puppet freshness on srv193 is OK: puppet ran at Wed Jul 18 02:56:41 UTC 2012 [03:12:45] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:09:43] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:18:43] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [05:27:54] good morning [05:42:23] morning [05:42:34] you're on line early [06:18:08] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [07:31:37] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [07:58:40] !log gallium: added Firefox 14 to Testswarm, disabled Firefox 13. [07:58:49] Logged the message, Master [08:00:09] what's testswarm? [08:02:39] paravoid: a PHP application to distribute Javascript unit tests to a swarm of web browsers [08:03:00] paravoid: it gives us reports such as http://integration.mediawiki.org/testswarm/user/MediaWiki/ [08:03:22] ah, cool [08:04:35] the next version is on labs at http://integration.wmflabs.org/testswarm/ [08:05:45] anyway that is fully managed by Timo and I :-D [08:06:05] paravoid: will you be available today to review the changes I made in puppet for beta ? [08:07:30] yes [08:07:37] I am now too if you want [08:11:28] paravoid: so there is https://gerrit.wikimedia.org/r/#/c/15545/ [08:11:33] a bit long change [08:12:03] the idea is to get rid of the deployment-prep NFS instance which just export some Glusterfs filesystem [08:12:19] so the idea is to directly mount /data/project/somepath on the instance [08:12:55] has this been tested? [08:13:05] also, I'm not sure if gluster works well enough [08:13:08] hate gluster hate [08:13:49] yeah a recurring rant :-] [08:14:04] has this been tested with ::self? [08:14:11] not at all [08:14:14] maybe I should [08:14:38] anyway late at night yesterday I figured out I was mounting the gluster FS using type=nfs [08:14:43] so I need to amend the change I guess [08:15:10] however you prefer [08:15:19] it doesn't affect production, so I don't mind merging it as it is [08:15:28] but if it breaks, you get to keep the pieces :-) [08:15:35] well it will break [08:15:43] need to amend the type=nfs to the Gluster FS type [08:16:34] also it might break production since I have factored out common code from nfs::upload to the new nfs::common-upload [08:16:54] then the production class nfs::upload include the nfs::common-upload [08:23:27] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [08:24:07] New review: Hashar; "Patchset 3 change the wrong NFS fstype to glisters." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15545 [08:24:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [08:24:18] paravoid: ^ patchset 3 uses glusterfs [08:25:28] I need a new instance [08:39:29] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [08:58:16] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [08:58:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [08:58:55] New review: Hashar; "Patchset 4 change /usr/local/apache to use GlusterFS instead of the wrong NFS." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15545 [09:07:23] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:16:07] New patchset: Faidon; "stafford: add puppetmaster config for modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15869 [09:16:36] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:16:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15869 [09:16:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15777 [09:17:11] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15869 [09:20:30] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15778 [09:20:39] PROBLEM - Host mw1107 is DOWN: PING CRITICAL - Packet loss = 100% [09:22:47] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [09:23:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [09:25:26] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15779 [09:26:46] fuck you gerrit. [09:26:53] three commits, three merges [09:31:54] I'm very happy we're at least talking about ditching it [09:32:20] isn't it the default workflow ? [09:32:36] not that if you ever rebased your commit, the sha1 changed do that indeed trigger a new patchset [09:48:11] paravoid: so look like I can't mount a subdirectory in a GlusterFS volume :/ [09:48:47] paravoid: Can you tell me why we are using gerrit? [09:49:13] I'm not saying that we shouldn't.. I'm just wanting to learn. [09:49:45] r0csteady: cause openstack uses it [09:49:52] and fit the pre commit workflow we wanted to adapt [09:50:19] our previous system was svn + a homegrown code review tool which only allowed post commit review [09:50:28] that did not scale for Ops and MediaWiki core review [09:50:51] so we happily switched to Gerrit (used with success to develop the Android kernel/API and the OpenStack infra) [09:51:09] then people got confused and want to ditch it out for greener pastures [09:52:15] ahh I see [09:52:44] Does this mean that we are researching other options? Or no? [09:53:08] That makes sense. [09:53:22] paravoid: apparently we can't mount a sub directory of a volume. So I could use a bind mount instead to bind /data/project/subdir to some place I want. Does it make any sense ? [09:54:03] r0csteady: we had a post this night in wikitech-l http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/62461 [09:54:15] r0csteady: and see http://www.mediawiki.org/wiki/Git/Gerrit_evaluation [09:54:55] hashar: ewwww. [09:56:27] paravoid: would be something like: mount -t bind /data/project/upload6 /mnt/upload6 [09:58:23] why a symlink wouldn't work? [10:00:07] I find them being ugly [10:00:33] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:00:37] but that might end up being the pragmatic solution [10:01:41] uglier than bind mounts? :) [10:01:54] well ok hmm [10:02:00] lets do sym links [10:08:31] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [10:09:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [10:11:08] ideally you wouldn't have hardcoded paths :) [10:11:34] well I am not going to rewrite the whole puppet class :-] [10:13:31] ARGHHH [10:13:37] /mnt/upload6/upload6/ [10:20:43] New review: Hashar; "Patchset 6 : finally use symbolic links since Gluster does not let mount a volume subdirectory. Tes..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15545 [10:20:57] paravoid: so patchset 6 is using symlinks https://gerrit.wikimedia.org/r/#/c/15545/ [10:20:59] looks fine [10:22:46] I am out for lunch. [10:38:14] mark: around? [10:38:36] mark: I'm going to work on moving stuff to modules; do you want you or someone else to review them pre-merge? [10:38:43] or should I just merge them as they are? [10:38:57] note that I intend to reorganize/clean them up a bit too [11:22:26] New patchset: Faidon; "ssh: move to a module & reorganize/cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15874 [11:23:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15874 [11:47:21] PROBLEM - Host mw1083 is DOWN: PING CRITICAL - Packet loss = 100% [11:49:27] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [11:52:27] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [11:55:48] paravoid: i'm in the data center today [11:55:53] so I don't have time for reviews today [11:56:10] hi [11:56:15] it was more of a generic question [11:56:15] the ntp one is fine i think, I looked at it really briefly [11:56:19] if you do others I would like to review [11:56:27] do you want to review every piece beforehand or not [11:56:36] I did ssh too, see gerrit 15874 [11:56:37] well [11:56:54] I didn't merge that, I was waiting for a reply on that question :) [11:56:54] yeah, I think, since then we can brainstorm about how to handle it [11:57:02] i'm sure it's fine what you did, but it would be nice anyway [11:57:06] sure, no problem [11:57:52] I also sent a mail to ops@ to inform the rest of the team [12:00:19] it was more of a "do you want to be bothered" question, I always prefer to work with others :) [12:33:17] New patchset: Hashar; "clean role::cache::upload from labs stuff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15882 [12:33:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15882 [12:53:04] Logged the message, Master [13:55:47] New patchset: Andrew Bogott; "Another attempt to get partman to partition different drives in different ways." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15887 [13:56:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15887 [13:57:20] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15887 [14:25:06] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.45 ms [14:50:15] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:38] New patchset: Andrew Bogott; "Yet another" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15894 [15:02:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15894 [15:02:27] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15894 [15:11:15] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [15:13:12] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [15:14:06] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [15:20:06] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [15:23:15] andrewbogott: just glanced at your partman config - in case you're worried that the priorities are having an effect, they don't actually have to be between the min and max sizes. I started setting them all to 100 since the types of configs we build don't generally allow for variations in size. [15:24:22] maplebed: Thanks for looking. The main thing I'm confused about is what exactly sda1, sda2, sda3 refer to. [15:24:31] Is it clear what I'm trying to do? [15:24:37] sadly no. [15:24:38] :P [15:24:57] I want two of the drives in a single raid, the other drives sliced vertically into two more raids. [15:25:04] (does 'vertically' make sense here?) [15:26:13] I haven't tried setting up multiple raids, but I do have a suggestion. [15:26:37] is it ok to have all the actual disks partitioned the same way? [15:26:55] Not really. I think I know how to do that. [15:27:06] Getting partman to /not/ do that is exactly what I don't understand. [15:27:52] My goal is to have a volume that is explicitly segregated onto separate drives from the main storage volume. [15:28:14] have you tried repeating lines 14-25 (once for the two separated disks, and the second time for the rest of the disks)? [15:28:38] sorry, 14-28 [15:28:52] oh. yeah, that was your previous attempt. [15:28:53] hmm. [15:28:53] https://gerrit.wikimedia.org/r/#/c/15887/1/files/autoinstall/partman/virt-raid10-cisco-ceph.cfg [15:29:09] Well, that attempt resulted in a non-bootable system. So it might have been the right track but hard to tell. [15:29:21] ok, next suggestion: name them differently. [15:29:51] when I"ve interrupted the install sequence and looked at the actual files partman downloaded, one was named 'expert_recipe' [15:30:09] it might be that it's tripping up with using the same name for the two different recipes. [15:30:13] Ah, so, you mean revert back to my two-section approach but give each section a new name? [15:30:35] yeah - though it's also possible that recipe and expert_recipe actually have a specific meaning. [15:30:38] it's a shot in the dark. [15:30:45] so partman-auto/expert_recipe <- the second part is arbitrary? [15:30:47] but one thing you can do to get a little more insight into what's happening: [15:30:50] Oh, ok, I see. Worth trying! [15:30:57] let it do the install thing (and probably fail) [15:31:08] take the host out of netboot.cfg [15:31:17] and then reinstall - it'll drop you into the partitioning menu [15:31:26] back out of that and launch the shell, then you can look at the actual partition layout [15:31:37] and see what it did up close (and maybe see what it did right or wrong) [15:33:26] ok, that sounds reasonable. [15:33:40] DEbugging partman is such a slooooow process [15:33:46] yes. [15:34:02] how big are these disks? [15:34:14] 128 I think [15:36:37] k. your partition sizes look ok. [15:40:29] maplebed: I think the recipe name must be non-arbitrary. Because right now we have "d-i partman-auto/disk" and "d-i partman-auto/expert_recipe" right in a row [15:40:40] and the only thing distinguishing them is the /last_section [15:41:13] yeah, that fits. [15:41:17] ah well. [15:42:09] mark, paravoid, partman thoughts? [15:44:14] maplebed: I bet I need two different partman scripts, and that our current netboot.cfg only supports one [15:44:26] lol [15:45:00] I still say run the previous config you had and see what it did (the one with two copies of the disk list and partition recipe) [15:45:14] 'k [15:47:04] New patchset: Andrew Bogott; "Revert "Yet another"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15897 [15:47:40] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15897 [15:54:32] andrewbogott: regarding? [15:55:28] oh, another random idea - duplicate the disk list and recipe, but only have one raid section. [15:55:39] (now I feel like I'm just in the peanut gallery) [15:55:58] paravoid: on how to get multiple raid configs and multiple disk sets to apply to a single system. [15:56:14] what do you mean? [15:56:21] wait, I think I should read the scrollback :) [15:56:50] paravoid: I want disks a-f divided into two raid-10s. And disks g and h used in a separate, single raid-1. [15:57:42] maplebed: Turns out partman won't even run my script as written… I must've missed that message last time. [15:58:09] paravoid: for example, https://gerrit.wikimedia.org/r/#/c/15858/ [15:58:43] maplebed: Hm, I guess it would help if I did the merge on sockpuppet. [15:58:51] +1 [16:12:59] hm, same results this time. [16:13:20] paravoid: any thoughts? Is it at least clear what I'm trying to do? [16:13:58] well, not clear enough :) [16:14:03] you want multiple md devices? [16:14:17] i.e. an md0 with sd[abcd]1 and an md1 with sd[abcd]2? [16:16:43] Um… that's what the existing partman does, I think? [16:17:06] I want three drives total. [16:17:37] sd[abcdef]1 and sd[abcdef]2 and sd[gh] [16:19:12] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [16:22:29] * andrewbogott will be back shortly [16:41:10] New patchset: Jalexander; "Add Mexico, Puerto Rico, US Virgin Islands and outlying islands" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15900 [16:58:36] paravoid: Sorry, I skipped out mid-conversation. Was I making sense, just then? [17:25:57] Logged the message, Master [17:32:18] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [17:34:57] New patchset: Lcarr; "temporarily blocking smtp out so that I can fix icinga without paging everyone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15907 [17:35:09] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [17:35:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15907 [17:37:32] Logged the message, Master [17:44:28] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [17:47:59] ok, I'll boot it and pray to some kinda flying spaghetti monster [17:48:03] thanks! [17:48:22] paravoid: You went to bed while I was offline, didn't you? Damn. [17:48:37] nope [17:48:44] I was on a call [17:50:52] hey Ryan_Lane [17:51:03] morning [17:51:10] awesome on the beginning of the modules, btw [17:52:27] cmjohnson1: any eta on search35? [17:52:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15907 [17:54:53] Ryan_Lane: so, I won't stay for too long, it's getting late [17:55:02] no worries [17:55:03] Ryan_Lane: but do you want to talk about the migrations a bit? [17:55:06] sure [17:55:08] so… [17:55:16] I did a rsync of the _base directory across [17:55:21] to virt6-8 [17:56:15] to migrate vms, we need to turn them off, rsync their files, update the database, then reboot them via nova [17:57:01] :/ [17:57:09] OK, so, once again… Ryan_Lane, paravoid, do either of you have reason to think that it's possible, with our current partman scheme, to partition different sets of disks in different ways? [17:57:26] andrewbogott: should be [17:57:30] paravoid: yeah, it sucks [17:57:36] Ryan_Lane: Any examples? [17:57:40] I hope that block migration is fixed in precisw [17:57:41] andrewbogott: oops, sorry :/ [17:57:42] *precise [17:57:54] andrewbogott: no examples, sorry :( [17:58:28] Ryan_Lane: It seems like you'd need two different partman scripts to apply two different partitionings.... [17:58:43] paravoid: No worries. [17:58:48] andrewbogott: I have no idea tbh [17:58:56] Just, frustrating partman is frustrating [17:59:00] yeah it is [17:59:01] really? it should be possible to do it via one file, I'd think [17:59:04] partman sucks [17:59:38] last time I was messing with it I was thinking of setting up a kvm locally as to be able to do tests without waiting for the Ciscos to boot [18:00:16] paravoid: so, the ceph people are saying to separate the journal from the data [18:00:19] which is a sane recommendation [18:00:20] I know [18:00:25] which is why we were looking at splitting the sets [18:00:26] and to put journal to ssds too [18:00:36] yeah [18:00:37] * Aaron|home heard "ceph" [18:00:40] also sane [18:00:52] Ryan_Lane: If you look at the history of virt-raid10-cisco-ceph.cfg you can see the different things I've tried. [18:01:46] Ryan_Lane: so, re: migrations, do you have a plan? [18:01:58] do them one at a time [18:01:59] :) [18:02:04] do it per project? or random VMs? [18:02:04] actually, that's a lie [18:02:11] I plan on doing it 3 at a time, if possible [18:02:25] likely best to start with less needed vms [18:02:39] also, did anyone look at the bug that was reported this morning? [18:02:47] which one? [18:03:36] instances can't be created [18:03:40] or rebooted [18:04:14] https://bugzilla.wikimedia.org/show_bug.cgi?id=38473 [18:05:13] bridge looks fine [18:05:19] devices look good [18:05:26] (this instance is on virt7) [18:05:30] hm, I wonder why I didn't get that mail [18:05:40] you may not be on the default cc list [18:05:59] dnsmasq is running [18:06:24] oh. was this one of the fucked up instances? [18:06:29] or a new one? [18:07:15] funny. cloud-init ran [18:07:38] the instance pings [18:07:45] it seems just the metadata service was having issues [18:08:32] based on the logs, the api service is returning data fine [18:08:55] at least as of: 2012-07-18 15:22:22,084 [18:09:28] New patchset: Andrew Bogott; "Now I'll settle for a complete run, even." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15943 [18:09:47] 169.254.169.254 is only bound on virt2 [18:09:52] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15943 [18:10:24] it isn't in the iptables rules on virt7, which is good [18:10:41] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15943 [18:10:50] I'm going to reboot 0000034b to see if it is working magically now [18:11:04] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [18:11:19] nope [18:11:21] wtf [18:12:06] I restarted the API service [18:12:18] timed out changed to reset by peer [18:12:23] so the instance is talking to the service [18:12:53] seems the api is timing out to the db [18:15:43] RECOVERY - Host search35 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [18:15:52] RECOVERY - Puppet freshness on search35 is OK: puppet ran at Wed Jul 18 18:15:41 UTC 2012 [18:16:21] it's funny. the metadata service seems to be returning 200's [18:16:27] no clue why the instance is saying it's timing out [18:17:23] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:17:46] hi Ryan_Lane :) [18:17:51] howdy [18:18:00] got a lab issue this afternoon [18:18:09] you might already be aware about it https://bugzilla.wikimedia.org/show_bug.cgi?id=38473 [18:18:23] yeah [18:18:23] cloud-init running: Wed, 18 Jul 2012 13:47:07 +0000. up 5.13 seconds waiting for metadata service at http://169.254.169.254/2009-04-04/meta-data/instance-id [18:18:26] already looking at it [18:18:49] that address i a bit odd, IIRC it is a local link address used when you don't have any IP configuration [18:19:09] that's handled magically [18:19:19] the odd thing is that the api is returning [18:19:25] but it's taking 9 seconds to do so [18:20:21] also the list of instances on lab console have been veryyyy slow all the day [18:20:29] both might have a common root cause [18:21:18] yeah. the api service is really slow [18:22:09] have you updated something recently? [18:22:12] no [18:22:55] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [18:23:35] New patchset: Lcarr; "telling iptables what smtp is" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15946 [18:24:12] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15946 [18:28:45] New patchset: Lcarr; "smtp is tcp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15948 [18:29:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15948 [18:35:51] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15948 [18:40:55] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [18:42:04] New patchset: Andrew Bogott; "..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15951 [18:42:40] New patchset: Lcarr; "reactivating icinga files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15952 [18:43:15] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15951 [18:43:16] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15952 [18:44:17] andrewbogott: good for me to merge that ? [18:44:31] LeslieCarr: sure, thanks. [18:45:35] LeslieCarr: Even better if you fix it :) [18:45:46] haha [18:46:14] :( [18:47:49] RECOVERY - Puppet freshness on neon is OK: puppet ran at Wed Jul 18 18:47:46 UTC 2012 [18:52:09] ignore the alerts [18:52:10] sorry [18:52:24] yay! I was just about to ask. [18:52:46] maplebed: ignore [18:52:54] k. tnx. [18:53:03] i did a iptables blocking port 25 but obviously it's too sneaky [18:53:08] lol [18:53:20] those are the best. [18:53:23] ok, bbl. [18:58:22] kk [19:00:07] PROBLEM - Host search35 is DOWN: PING CRITICAL - Packet loss = 100% [19:02:58] New patchset: Lcarr; "prevents icinga from overwriting config files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15955 [19:03:10] cmjohnson1: appears to be booting. thanks! [19:03:33] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15955 [19:03:39] yep yep [19:03:52] RECOVERY - Host search35 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [19:09:54] Logged the message, Master [19:12:48] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:09] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:27] RECOVERY - SSH on virt1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:17:36] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.49 ms [19:17:45] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:25:10] New patchset: Andrew Bogott; "..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15957 [19:25:46] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15957 [19:30:24] 1.4.3 and swauth. [19:30:24] maplebed: ^^ [19:30:26] heh [19:30:33] soon to be 1.5.0 and swauth. [19:30:36] (but not yet) [19:30:49] binasher: ok, the enwiki flaggedimages & flaggedtemplates tables can be emptied out now [19:31:36] Aaron|home: excellent, thanks [19:32:45] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:14] j^: by 'vm' do you mean on labs or on some virtual enviornment on my laptop? [19:33:37] I've never set it up on a single host; only a cluster (of vms or real hosts). [19:34:05] nope, sorry. [19:34:24] I can add you to the project on labs if you want to poke at it though. [19:35:55] k. one sec. what's your labs username? [19:36:12] ok. [19:37:54] ok, you're in teh project. it's called 'swift'. one of the front end hosts (entry into the cluster from bastion) is swift-fe1. [19:38:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:18] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.41 ms [19:38:20] it was set up following http://wikitech.wikimedia.org/view/Swift/Setup_New_Swift_Cluster_%28labs%29 [19:38:20] dunno if you all have the background, but j^ is Jan Gerber, who is here in SF right now working on getting Timed Media Handler finished once and for all [19:38:29] tnx robla [19:39:13] good to know. [19:39:28] maplebed == Ben Hartshorne, who did all of the ops work (and a fair bit of dev work) on the Swift deployment [19:39:42] (that's for j^'s benefit :) [19:40:27] part of what j^ is doing is following up on my byte range paranoia [19:40:53] which, as it turns out, it looks like Squid (and probably Varnish) are likely to be a bigger problem than Swift [19:40:57] j^: feel free to ping me with qusetions anytime. sorry I"m not in sf to chat in person... [19:41:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.976 seconds [19:41:09] I think I'll be back before you leave though. [19:41:27] PROBLEM - SSH on virt1001 is CRITICAL: Connection refused [19:41:40] j^: for originals, yes. for thumbnails, there's one more step before it'll work (I think) [19:42:06] I'm working on that step with aaron atm. [19:43:28] ryan_lane: I don't see any evidence (either from my experience, or from existing examples, or from google) that it is possible to do separate things to separate drives via partman. So I'm not sure it's useful for me to keep making random changes in the syntax and hope that it will happen. Is there anyone in the whole world who understands partman well enough to tell me if this is possible? Otherwise I can just setup partman t [19:43:29] leave two drives untouched and then add that last raid by hand, maybe? [19:45:24] Logged the message, Master [19:52:58] andrewbogott: the approach I've taken with swift is that partman builds the OS drives (first two raid1) and doesn't do anything to the rest. puppet then formats and mounts all the rest. [19:54:10] maplebed: That sounds reasonable. Can you point me to the manifest that does the formatting? [20:01:51] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [20:14:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [20:29:36] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [20:30:51] PROBLEM - NTP on srv278 is CRITICAL: NTP CRITICAL: Offset unknown [20:33:15] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [20:34:36] RECOVERY - NTP on srv278 is OK: NTP OK: Offset -0.03993880749 secs [20:38:41] New review: Platonides; "A .svg won't work very well. Try using http://upload.wikimedia.org/wikipedia/commons/thumb/d/d1/Wiki..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/15807 [20:41:07] New patchset: Andrew Bogott; "Give up for now and just leave two drives untouched." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15961 [20:41:43] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15961 [20:43:36] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [20:53:45] New patchset: Aaron Schulz; "Added multiwrite backend config - not yet used." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15963 [20:55:38] New patchset: Andrew Bogott; "May as well call this 'srv' instead of 'stuff' now that I'm not debugging anymore." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15964 [20:56:14] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15964 [20:57:49] New patchset: Aaron Schulz; "Added multiwrite backend config - not yet used." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15963 [20:58:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:02:30] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:08:03] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [21:08:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.498 seconds [21:13:28] New patchset: Cmjohnson; "adjusting netboot.cfg for db63-db77" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15966 [21:14:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15966 [21:14:50] sure [21:15:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15966 [21:15:16] wow, lots of db machines now :) [21:21:38] hi guys, [21:21:49] i'm building new .debs for nodejs and npm [21:21:57] johnduhart has already done the hard work here [21:21:57] http://svn.mediawiki.org/viewvc/mediawiki/trunk/debs/nodejs/ [21:22:30] I need to commit a couple of small things (changelog, control) [21:22:31] well awesome now I can't even make partman do the simple 6-disk raid. 8 yes, 6 no, wtf [21:22:53] I don't have commit permissions to that part of the svn repo [21:22:55] New patchset: Alex Monk; "Fix a logo screw up I made in change 15730." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15807 [21:23:05] should I move nodejs and npm over to git/gerrit? [21:28:39] !log adding php5 packages to precise-wikimedia repo [21:28:46] Logged the message, notpeter [21:31:18] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:51] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 36.01 ms [21:39:37] !log removing mw1 from apaches pool to do precise test install [21:39:44] Logged the message, notpeter [21:41:12] PROBLEM - SSH on virt1002 is CRITICAL: Connection refused [21:41:34] !log scratch that. removing srv289 from apaches pool for precise testing, not mw1 [21:41:42] Logged the message, notpeter [21:41:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:21] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [21:52:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.284 seconds [21:53:21] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [22:02:27] !log ok, changed my mind one more time. removing srv194 from apaches pool for precise testing [22:02:35] Logged the message, notpeter [22:06:02] :D [22:08:56] New patchset: Pyoungmeister; "giving srv194 an apache role class for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15975 [22:09:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15975 [22:25:36] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:26:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:09] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 36.33 ms [22:38:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [22:42:52] PROBLEM - Apache HTTP on srv281 is CRITICAL: Connection refused [22:44:59] New patchset: Pyoungmeister; "give srv194 the mw.cfg for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15978 [22:45:35] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15975 [22:45:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15978 [22:45:49] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15978 [22:51:07] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:56:40] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 35.47 ms [23:05:52] New patchset: Lcarr; "enabling https on ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15979 [23:06:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15979 [23:06:44] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15979 [23:10:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:31] PROBLEM - Host mw1097 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:40] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.037 seconds [23:21:12] HTTP OK HTTP/1.1 400 Bad Request [23:22:21] Yeah I've asked about that before [23:22:28] Apparently a 400 is legitimately the correct response for that check [23:26:13] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 35.52 ms [23:34:49] New patchset: Lcarr; "fixed ganglia https" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15983 [23:35:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15983 [23:35:32] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15983 [23:44:49] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [23:50:23] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [23:54:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:56:32] New patchset: Asher; ".gitreview" [operations/debs/mysqlatfacebook] (master) - https://gerrit.wikimedia.org/r/15986 [23:57:20] Change merged: Asher; [operations/debs/mysqlatfacebook] (master) - https://gerrit.wikimedia.org/r/15986