[00:03:10] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.36 ms [00:06:19] PROBLEM - SSH on virt1001 is CRITICAL: Connection refused [00:27:37] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:22:41] PROBLEM - Host mw1006 is DOWN: PING CRITICAL - Packet loss = 100% [01:48:11] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [01:48:47] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 613s [01:51:02] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [01:51:56] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:54:47] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 13s [01:55:05] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [02:48:00] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.43 ms [02:56:51] RECOVERY - Puppet freshness on srv193 is OK: puppet ran at Wed Jul 18 02:56:41 UTC 2012 [03:12:45] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:09:43] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:18:43] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [05:27:54] good morning [05:42:23] morning [05:42:34] you're on line early [06:18:08] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [07:31:37] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [07:58:40] !log gallium: added Firefox 14 to Testswarm, disabled Firefox 13. [07:58:49] Logged the message, Master [08:00:09] what's testswarm? [08:02:39] paravoid: a PHP application to distribute Javascript unit tests to a swarm of web browsers [08:03:00] paravoid: it gives us reports such as http://integration.mediawiki.org/testswarm/user/MediaWiki/ [08:03:22] ah, cool [08:04:35] the next version is on labs at http://integration.wmflabs.org/testswarm/ [08:05:45] anyway that is fully managed by Timo and I :-D [08:06:05] paravoid: will you be available today to review the changes I made in puppet for beta ? [08:07:30] yes [08:07:37] I am now too if you want [08:11:28] paravoid: so there is https://gerrit.wikimedia.org/r/#/c/15545/ [08:11:33] a bit long change [08:12:03] the idea is to get rid of the deployment-prep NFS instance which just export some Glusterfs filesystem [08:12:19] so the idea is to directly mount /data/project/somepath on the instance [08:12:55] has this been tested? [08:13:05] also, I'm not sure if gluster works well enough [08:13:08] hate gluster hate [08:13:49] yeah a recurring rant :-] [08:14:04] has this been tested with ::self? [08:14:11] not at all [08:14:14] maybe I should [08:14:38] anyway late at night yesterday I figured out I was mounting the gluster FS using type=nfs [08:14:43] so I need to amend the change I guess [08:15:10] however you prefer [08:15:19] it doesn't affect production, so I don't mind merging it as it is [08:15:28] but if it breaks, you get to keep the pieces :-) [08:15:35] well it will break [08:15:43] need to amend the type=nfs to the Gluster FS type [08:16:34] also it might break production since I have factored out common code from nfs::upload to the new nfs::common-upload [08:16:54] then the production class nfs::upload include the nfs::common-upload [08:23:27] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [08:24:07] New review: Hashar; "Patchset 3 change the wrong NFS fstype to glisters." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15545 [08:24:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [08:24:18] paravoid: ^ patchset 3 uses glusterfs [08:25:28] I need a new instance [08:39:29] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [08:58:16] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [08:58:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [08:58:55] New review: Hashar; "Patchset 4 change /usr/local/apache to use GlusterFS instead of the wrong NFS." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15545 [09:07:23] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:16:07] New patchset: Faidon; "stafford: add puppetmaster config for modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15869 [09:16:36] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:16:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15869 [09:16:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15777 [09:17:11] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15869 [09:20:30] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15778 [09:20:39] PROBLEM - Host mw1107 is DOWN: PING CRITICAL - Packet loss = 100% [09:22:47] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [09:23:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [09:25:26] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15779 [09:26:46] fuck you gerrit. [09:26:53] three commits, three merges [09:31:54] I'm very happy we're at least talking about ditching it [09:32:20] isn't it the default workflow ? [09:32:36] not that if you ever rebased your commit, the sha1 changed do that indeed trigger a new patchset [09:48:11] paravoid: so look like I can't mount a subdirectory in a GlusterFS volume :/ [09:48:47] paravoid: Can you tell me why we are using gerrit? [09:49:13] I'm not saying that we shouldn't.. I'm just wanting to learn. [09:49:45] r0csteady: cause openstack uses it [09:49:52] and fit the pre commit workflow we wanted to adapt [09:50:19] our previous system was svn + a homegrown code review tool which only allowed post commit review [09:50:28] that did not scale for Ops and MediaWiki core review [09:50:51] so we happily switched to Gerrit (used with success to develop the Android kernel/API and the OpenStack infra) [09:51:09] then people got confused and want to ditch it out for greener pastures [09:52:15] ahh I see [09:52:44] Does this mean that we are researching other options? Or no? [09:53:08] That makes sense. [09:53:22] paravoid: apparently we can't mount a sub directory of a volume. So I could use a bind mount instead to bind /data/project/subdir to some place I want. Does it make any sense ? [09:54:03] r0csteady: we had a post this night in wikitech-l http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/62461 [09:54:15] r0csteady: and see http://www.mediawiki.org/wiki/Git/Gerrit_evaluation [09:54:55] hashar: ewwww. [09:56:27] paravoid: would be something like: mount -t bind /data/project/upload6 /mnt/upload6 [09:58:23] why a symlink wouldn't work? [10:00:07] I find them being ugly [10:00:33] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:00:37] but that might end up being the pragmatic solution [10:01:41] uglier than bind mounts? :) [10:01:54] well ok hmm [10:02:00] lets do sym links [10:08:31] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [10:09:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [10:11:08] ideally you wouldn't have hardcoded paths :) [10:11:34] well I am not going to rewrite the whole puppet class :-] [10:13:31] ARGHHH [10:13:37] /mnt/upload6/upload6/ [10:20:43] New review: Hashar; "Patchset 6 : finally use symbolic links since Gluster does not let mount a volume subdirectory. Tes..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15545 [10:20:57] paravoid: so patchset 6 is using symlinks https://gerrit.wikimedia.org/r/#/c/15545/ [10:20:59] looks fine [10:22:46] I am out for lunch. [10:38:14] mark: around? [10:38:36] mark: I'm going to work on moving stuff to modules; do you want you or someone else to review them pre-merge? [10:38:43] or should I just merge them as they are? [10:38:57] note that I intend to reorganize/clean them up a bit too [11:22:26] New patchset: Faidon; "ssh: move to a module & reorganize/cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15874 [11:23:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15874 [11:47:21] PROBLEM - Host mw1083 is DOWN: PING CRITICAL - Packet loss = 100% [11:49:27] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [11:52:27] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [11:55:48] paravoid: i'm in the data center today [11:55:53] so I don't have time for reviews today [11:56:10] hi [11:56:15] it was more of a generic question [11:56:15] the ntp one is fine i think, I looked at it really briefly [11:56:19] if you do others I would like to review [11:56:27] do you want to review every piece beforehand or not [11:56:36] I did ssh too, see gerrit 15874 [11:56:37] well [11:56:54] I didn't merge that, I was waiting for a reply on that question :) [11:56:54] yeah, I think, since then we can brainstorm about how to handle it [11:57:02] i'm sure it's fine what you did, but it would be nice anyway [11:57:06] sure, no problem [11:57:52] I also sent a mail to ops@ to inform the rest of the team [12:00:19] it was more of a "do you want to be bothered" question, I always prefer to work with others :) [12:33:17] New patchset: Hashar; "clean role::cache::upload from labs stuff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15882 [12:33:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15882 [12:53:04] Logged the message, Master [13:55:47] New patchset: Andrew Bogott; "Another attempt to get partman to partition different drives in different ways." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15887 [13:56:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15887 [13:57:20] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15887 [14:25:06] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.45 ms [14:50:15] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:38] New patchset: Andrew Bogott; "Yet another" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15894 [15:02:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15894 [15:02:27] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15894 [15:11:15] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [15:13:12] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [15:14:06] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [15:20:06] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [15:23:15] andrewbogott: just glanced at your partman config - in case you're worried that the priorities are having an effect, they don't actually have to be between the min and max sizes. I started setting them all to 100 since the types of configs we build don't generally allow for variations in size. [15:24:22] maplebed: Thanks for looking. The main thing I'm confused about is what exactly sda1, sda2, sda3 refer to. [15:24:31] Is it clear what I'm trying to do? [15:24:37] sadly no. [15:24:38] :P [15:24:57] I want two of the drives in a single raid, the other drives sliced vertically into two more raids. [15:25:04] (does 'vertically' make sense here?) [15:26:13] I haven't tried setting up multiple raids, but I do have a suggestion. [15:26:37] is it ok to have all the actual disks partitioned the same way? [15:26:55] Not really. I think I know how to do that. [15:27:06] Getting partman to /not/ do that is exactly what I don't understand. [15:27:52] My goal is to have a volume that is explicitly segregated onto separate drives from the main storage volume. [15:28:14] have you tried repeating lines 14-25 (once for the two separated disks, and the second time for the rest of the disks)? [15:28:38] sorry, 14-28 [15:28:52] oh. yeah, that was your previous attempt. [15:28:53] hmm. [15:28:53] https://gerrit.wikimedia.org/r/#/c/15887/1/files/autoinstall/partman/virt-raid10-cisco-ceph.cfg [15:29:09] Well, that attempt resulted in a non-bootable system. So it might have been the right track but hard to tell. [15:29:21] ok, next suggestion: name them differently. [15:29:51] when I"ve interrupted the install sequence and looked at the actual files partman downloaded, one was named 'expert_recipe' [15:30:09] it might be that it's tripping up with using the same name for the two different recipes. [15:30:13] Ah, so, you mean revert back to my two-section approach but give each section a new name? [15:30:35] yeah - though it's also possible that recipe and expert_recipe actually have a specific meaning. [15:30:38] it's a shot in the dark. [15:30:45] so partman-auto/expert_recipe <- the second part is arbitrary? [15:30:47] but one thing you can do to get a little more insight into what's happening: [15:30:50] Oh, ok, I see. Worth trying! [15:30:57] let it do the install thing (and probably fail) [15:31:08] take the host out of netboot.cfg [15:31:17] and then reinstall - it'll drop you into the partitioning menu [15:31:26] back out of that and launch the shell, then you can look at the actual partition layout [15:31:37] and see what it did up close (and maybe see what it did right or wrong) [15:33:26] ok, that sounds reasonable. [15:33:40] DEbugging partman is such a slooooow process [15:33:46] yes. [15:34:02] how big are these disks? [15:34:14] 128 I think [15:36:37] k. your partition sizes look ok. [15:40:29] maplebed: I think the recipe name must be non-arbitrary. Because right now we have "d-i partman-auto/disk" and "d-i partman-auto/expert_recipe" right in a row [15:40:40] and the only thing distinguishing them is the /last_section [15:41:13] yeah, that fits. [15:41:17] ah well. [15:42:09] mark, paravoid, partman thoughts? [15:44:14] maplebed: I bet I need two different partman scripts, and that our current netboot.cfg only supports one [15:44:26] lol [15:45:00] I still say run the previous config you had and see what it did (the one with two copies of the disk list and partition recipe) [15:45:14] 'k [15:47:04] New patchset: Andrew Bogott; "Revert "Yet another"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15897 [15:47:40] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15897 [15:54:32] andrewbogott: regarding? [15:55:28] oh, another random idea - duplicate the disk list and recipe, but only have one raid section. [15:55:39] (now I feel like I'm just in the peanut gallery) [15:55:58] paravoid: on how to get multiple raid configs and multiple disk sets to apply to a single system. [15:56:14] what do you mean? [15:56:21] wait, I think I should read the scrollback :) [15:56:50] paravoid: I want disks a-f divided into two raid-10s. And disks g and h used in a separate, single raid-1. [15:57:42] maplebed: Turns out partman won't even run my script as written… I must've missed that message last time. [15:58:09] paravoid: for example, https://gerrit.wikimedia.org/r/#/c/15858/ [15:58:43] maplebed: Hm, I guess it would help if I did the merge on sockpuppet. [15:58:51] +1 [16:12:59] hm, same results this time. [16:13:20] paravoid: any thoughts? Is it at least clear what I'm trying to do? [16:13:58] well, not clear enough :) [16:14:03] you want multiple md devices? [16:14:17] i.e. an md0 with sd[abcd]1 and an md1 with sd[abcd]2? [16:16:43] Um… that's what the existing partman does, I think? [16:17:06] I want three drives total. [16:17:37] sd[abcdef]1 and sd[abcdef]2 and sd[gh] [16:19:12] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [16:22:29] * andrewbogott will be back shortly [16:41:10] New patchset: Jalexander; "Add Mexico, Puerto Rico, US Virgin Islands and outlying islands" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15900 [16:58:36] paravoid: Sorry, I skipped out mid-conversation. Was I making sense, just then? [17:25:57] Logged the message, Master [17:32:18] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [17:34:57] New patchset: Lcarr; "temporarily blocking smtp out so that I can fix icinga without paging everyone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15907 [17:35:09] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [17:35:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15907 [17:37:32] Logged the message, Master [17:44:28] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [17:47:59] ok, I'll boot it and pray to some kinda flying spaghetti monster [17:48:03] thanks! [17:48:22] paravoid: You went to bed while I was offline, didn't you? Damn. [17:48:37] nope [17:48:44] I was on a call [17:50:52] hey Ryan_Lane [17:51:03] morning [17:51:10] awesome on the beginning of the modules, btw [17:52:27] cmjohnson1: any eta on search35? [17:52:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15907 [17:54:53] Ryan_Lane: so, I won't stay for too long, it's getting late [17:55:02] no worries [17:55:03] Ryan_Lane: but do you want to talk about the migrations a bit? [17:55:06] sure [17:55:08] so… [17:55:16] I did a rsync of the _base directory across [17:55:21] to virt6-8 [17:56:15] to migrate vms, we need to turn them off, rsync their files, update the database, then reboot them via nova [17:57:01] :/ [17:57:09] OK, so, once again… Ryan_Lane, paravoid, do either of you have reason to think that it's possible, with our current partman scheme, to partition different sets of disks in different ways? [17:57:26] andrewbogott: should be [17:57:30] paravoid: yeah, it sucks [17:57:36] Ryan_Lane: Any examples? [17:57:40] I hope that block migration is fixed in precisw [17:57:41] andrewbogott: oops, sorry :/ [17:57:42] *precise [17:57:54] andrewbogott: no examples, sorry :( [17:58:28] Ryan_Lane: It seems like you'd need two different partman scripts to apply two different partitionings.... [17:58:43] paravoid: No worries. [17:58:48] andrewbogott: I have no idea tbh [17:58:56] Just, frustrating partman is frustrating [17:59:00] yeah it is [17:59:01] really? it should be possible to do it via one file, I'd think [17:59:04] partman sucks [17:59:38] last time I was messing with it I was thinking of setting up a kvm locally as to be able to do tests without waiting for the Ciscos to boot [18:00:16] paravoid: so, the ceph people are saying to separate the journal from the data [18:00:19] which is a sane recommendation [18:00:20] I know [18:00:25] which is why we were looking at splitting the sets [18:00:26] and to put journal to ssds too [18:00:36] yeah [18:00:37] * Aaron|home heard "ceph" [18:00:40] also sane [18:00:52] Ryan_Lane: If you look at the history of virt-raid10-cisco-ceph.cfg you can see the different things I've tried. [18:01:46] Ryan_Lane: so, re: migrations, do you have a plan? [18:01:58] do them one at a time [18:01:59] :) [18:02:04] do it per project? or random VMs? [18:02:04] actually, that's a lie [18:02:11] I plan on doing it 3 at a time, if possible [18:02:25] likely best to start with less needed vms [18:02:39] also, did anyone look at the bug that was reported this morning? [18:02:47] which one? [18:03:36] instances can't be created [18:03:40] or rebooted [18:04:14] https://bugzilla.wikimedia.org/show_bug.cgi?id=38473 [18:05:13] bridge looks fine [18:05:19] devices look good [18:05:26] (this instance is on virt7) [18:05:30] hm, I wonder why I didn't get that mail [18:05:40] you may not be on the default cc list [18:05:59] dnsmasq is running [18:06:24] oh. was this one of the fucked up instances? [18:06:29] or a new one? [18:07:15] funny. cloud-init ran [18:07:38] the instance pings [18:07:45] it seems just the metadata service was having issues [18:08:32] based on the logs, the api service is returning data fine [18:08:55] at least as of: 2012-07-18 15:22:22,084 [18:09:28] New patchset: Andrew Bogott; "Now I'll settle for a complete run, even." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15943 [18:09:47] 169.254.169.254 is only bound on virt2 [18:09:52] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15943 [18:10:24] it isn't in the iptables rules on virt7, which is good [18:10:41] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15943 [18:10:50] I'm going to reboot 0000034b to see if it is working magically now [18:11:04] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [18:11:19] nope [18:11:21] wtf [18:12:06] I restarted the API service [18:12:18] timed out changed to reset by peer [18:12:23] so the instance is talking to the service [18:12:53] seems the api is timing out to the db [18:15:43] RECOVERY - Host search35 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [18:15:52] RECOVERY - Puppet freshness on search35 is OK: puppet ran at Wed Jul 18 18:15:41 UTC 2012 [18:16:21] it's funny. the metadata service seems to be returning 200's [18:16:27] no clue why the instance is saying it's timing out [18:17:23] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:17:46] hi Ryan_Lane :) [18:17:51] howdy [18:18:00] got a lab issue this afternoon [18:18:09] you might already be aware about it https://bugzilla.wikimedia.org/show_bug.cgi?id=38473 [18:18:23] yeah [18:18:23] cloud-init running: Wed, 18 Jul 2012 13:47:07 +0000. up 5.13 seconds waiting for metadata service at http://169.254.169.254/2009-04-04/meta-data/instance-id [18:18:26] already looking at it [18:18:49] that address i a bit odd, IIRC it is a local link address used when you don't have any IP configuration [18:19:09] that's handled magically [18:19:19] the odd thing is that the api is returning [18:19:25] but it's taking 9 seconds to do so [18:20:21] also the list of instances on lab console have been veryyyy slow all the day [18:20:29] both might have a common root cause [18:21:18] yeah. the api service is really slow [18:22:09] have you updated something recently? [18:22:12] no [18:22:55] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [18:23:35] New patchset: Lcarr; "telling iptables what smtp is" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15946 [18:24:12] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15946 [18:28:45] New patchset: Lcarr; "smtp is tcp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15948 [18:29:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15948 [18:35:51] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15948 [18:40:55] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [18:42:04] New patchset: Andrew Bogott; "..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15951 [18:42:40] New patchset: Lcarr; "reactivating icinga files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15952 [18:43:15] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15951 [18:43:16] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15952 [18:44:17] andrewbogott: good for me to merge that ? [18:44:31] LeslieCarr: sure, thanks. [18:45:35] LeslieCarr: Even better if you fix it :) [18:45:46] haha [18:46:14] :( [18:47:49] RECOVERY - Puppet freshness on neon is OK: puppet ran at Wed Jul 18 18:47:46 UTC 2012 [18:52:09] ignore the alerts [18:52:10] sorry [18:52:24] yay! I was just about to ask. [18:52:46] maplebed: ignore [18:52:54] k. tnx. [18:53:03] i did a iptables blocking port 25 but obviously it's too sneaky [18:53:08] lol [18:53:20] those are the best. [18:53:23] ok, bbl. [18:58:22] kk [19:00:07] PROBLEM - Host search35 is DOWN: PING CRITICAL - Packet loss = 100% [19:02:58] New patchset: Lcarr; "prevents icinga from overwriting config files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15955 [19:03:10] cmjohnson1: appears to be booting. thanks! [19:03:33] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15955 [19:03:39] yep yep [19:03:52] RECOVERY - Host search35 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [19:09:54] Logged the message, Master [19:12:48] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:09] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:27] RECOVERY - SSH on virt1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:17:36] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.49 ms [19:17:45] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:25:10] New patchset: Andrew Bogott; "..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15957 [19:25:46] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15957 [19:30:24] 1.4.3 and swauth. [19:30:24] maplebed: ^^ [19:30:26] heh [19:30:33] soon to be 1.5.0 and swauth. [19:30:36] (but not yet) [19:30:49] binasher: ok, the enwiki flaggedimages & flaggedtemplates tables can be emptied out now [19:31:36] Aaron|home: excellent, thanks [19:32:45] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:14] j^: by 'vm' do you mean on labs or on some virtual enviornment on my laptop? [19:33:37] I've never set it up on a single host; only a cluster (of vms or real hosts). [19:34:05] nope, sorry. [19:34:24] I can add you to the project on labs if you want to poke at it though. [19:35:55] k. one sec. what's your labs username? [19:36:12] ok. [19:37:54] ok, you're in teh project. it's called 'swift'. one of the front end hosts (entry into the cluster from bastion) is swift-fe1. [19:38:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:18] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.41 ms [19:38:20] it was set up following http://wikitech.wikimedia.org/view/Swift/Setup_New_Swift_Cluster_%28labs%29 [19:38:20] dunno if you all have the background, but j^ is Jan Gerber, who is here in SF right now working on getting Timed Media Handler finished once and for all [19:38:29] tnx robla [19:39:13] good to know. [19:39:28] maplebed == Ben Hartshorne, who did all of the ops work (and a fair bit of dev work) on the Swift deployment [19:39:42] (that's for j^'s benefit :) [19:40:27] part of what j^ is doing is following up on my byte range paranoia [19:40:53] which, as it turns out, it looks like Squid (and probably Varnish) are likely to be a bigger problem than Swift [19:40:57] j^: feel free to ping me with qusetions anytime. sorry I"m not in sf to chat in person... [19:41:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.976 seconds [19:41:09] I think I'll be back before you leave though. [19:41:27] PROBLEM - SSH on virt1001 is CRITICAL: Connection refused [19:41:40]