[00:00:11] btw, https://gerrit.wikimedia.org/r/15671 should be merged and deployed [00:00:12] if they are broken then I can remove them without breaking existing URLs [00:00:26] TimStarling: anything I did with shop was ages ago, and yes [00:00:29] unless they worked once, yes [00:00:44] I wasn't even aware of that shop [00:00:50] are there new changes or something? [00:00:56] I may have briefly seen it years ago [00:01:22] I am introducing new changes but it seems it was broken already [00:02:55] I see I did something 5 months ago [00:03:10] RT 2488 [00:03:20] there's a diff [00:03:33] good night [00:03:55] night, thanks for looking at it Platonides [00:04:08] it was just a quick look :) [00:05:13] TimStarling: I have to head out too, if you RT2488 doesn't make sense of my edits, shoot me an email and I'll work on it in the AM [00:05:27] err if you find that RT... [00:05:28] I will make it work as specified in RT 2488 [00:05:37] ok. thanks [00:12:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:51] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [00:19:17] New patchset: Tim Starling; "Fix shop and donate redirects" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/15760 [00:21:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [00:28:17] !log testing new redirects.conf on mw1 [00:28:25] Logged the message, Master [00:36:00] New review: Tim Starling; "Tested on a single server, now merging for full-scale deployment." [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/15698 [00:36:31] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/15698 [00:40:37] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/15760 [00:44:55] !log deploying new redirects.conf [00:45:06] Logged the message, Master [00:53:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:03:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [01:11:03] New patchset: Tim Starling; "Remove obsolete country portals" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15763 [01:27:45] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [01:30:50] TimStarling: that ended up being quite an overhaul with redirects.conf [01:31:07] yes [01:31:56] I've been tempted to rip out a bunch of the aliases for donate but they go back to the beginning of svn history, and I was afraid I'd break something [01:32:19] I did remove a lot of those aliases [01:32:41] mostly the ones that were broken already [01:32:58] I considered replacing the whole of redirects.conf with a PHP script [01:33:39] eventually I decided that the improvement in the readability of the configuration would not really justify the work done [01:34:05] it's huge from my perspective [01:34:11] the PHP script would have to generate the ServerAlias directive anyway [01:34:15] true [01:34:41] but I'm still considering sweeping changes to main.conf and remnant.conf [01:34:53] i need a much more complete list of URLs for my conf tester [01:35:30] i keep adding the stuff I'm asked to adjust, but i should really just grab a big collection from logs [01:35:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:35:47] I'm mostly responsible for the current structure of these files, I introduced it in about 2004 [01:35:48] changes to main/remnant ++ :-) [01:36:00] it's had a lot of chefs in the meantime though [01:36:30] I'm not sure if the fact that the files have only grown and not really changed in 8 years is a compliment or an insult [01:36:41] ha [01:36:53] maybe it was good enough, or maybe it's so horrendous that nobody else can bear to look at it [01:37:12] it was indeed terrifying the first time I made edits [01:37:34] prompted me to take the time to write a test :-) [01:37:49] but, I'm not sure what else can be done other than to handle a whole lot more in software [01:38:07] well, we can be smarter about how the files are constructed [01:38:21] VirtualHost sections can inherit settings from the server [01:38:26] true [01:38:28] even rewrite rules [01:38:45] so it would be possible to factor out some of the duplication [01:38:51] we still have the whole http/https overhaul to do as well [01:39:09] what overhaul is that? [01:39:19] oh, you mean changing redirects? [01:39:38] fixing redirects that bounce you from https back to http [01:39:58] maybe that can be fixed in the proxy [01:40:21] it'd probably be easier to fix it there than in 1000 places in the apache conf [01:40:39] there are some cases where redirects take you to http:// regardless of the protocol header [01:40:51] oop, sorry, IRC client got stuck there [01:41:31] it may not be that bad even within apache conf, but there were a few cases anyway [01:41:33] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 272 seconds [01:42:03] I've been thinking about merging document roots, I'm not sure if it's wise to merge all of them [01:42:12] there's a few special cases [01:42:34] so sometimes you get a choice between splitting document roots or using RewriteCond %{HTTP_HOST} [01:42:45] i see [01:43:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.187 seconds [01:43:49] alrighty, I'm off bed. thanks for whatever you do in cleaning up *conf! [01:44:08] thanks for helping, good night [01:44:51] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 291 seconds [01:50:33] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:51:00] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 658s [01:57:00] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 26 seconds [01:57:00] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 28s [02:17:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.286 seconds [02:45:48] [20:44] !log deploying new redirects.conf [02:47:08] mzmcbride@gonzo:~$ curl -Is "http://shop.wikimediafoundation.org" | grep Location [02:47:11] Location: http://wikimediafoundation.org/ [02:47:16] Is that the correct behavior? [02:47:37] shop.wikipedia.org behaves differently: [02:47:42] mzmcbride@gonzo:~$ curl -Is "http://shop.wikipedia.org" | grep LocationLocation: http://shop.wikimedia.org/ [02:50:01] Hmm, and https://mediawiki.org is still broken. Hrmph. [02:56:45] I hadn't restarted the apaches properly [02:57:09] Ah. [02:57:46] also it's cached [02:58:21] !log graceful restart of all apaches [02:58:30] Logged the message, Master [02:59:18] max-age is set to 1 month [03:02:00] Yes. [03:04:48] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:13:48] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:15:57] TimStarling: max-age won't matter to a test with curl, though, right? [03:16:23] Squid will use it if there's no s-maxage [03:16:28] I'm not sure if the shop strangeness has a bug. [03:22:42] RECOVERY - Puppet freshness on spence is OK: puppet ran at Tue Jul 17 03:22:11 UTC 2012 [03:43:58] I just learned that noc.wikimedia.org had useful content before "nocnocnocnocnoc". [03:57:01] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [05:36:39] morning [05:46:18] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [05:49:18] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [06:11:28] New patchset: Tim Starling; "Move the Include directives to a separate file" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/15766 [06:12:29] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/15766 [06:28:16] hello paravoid [06:28:22] hi Tim [06:28:39] do you know anything about wikimedia-task-appserver? [06:29:28] I vaguely remember porting it to precise [06:29:44] is it in git? [06:30:33] there's no operations/debs/wikimedia-task-appserver [06:30:54] doesn't seem so [06:31:14] it's in svn [06:31:35] don't see any commits from myself, it was probably a matter of copying it [06:31:53] sorry, I'm not very helpful I suppose [06:32:16] it's ok [06:32:30] I updated it lots of times while it was in subversion, so if it's still there I can probably deal with it [06:33:55] we have a script to check to see whether NTP is working [06:34:12] apparently it fails if you run it on too many servers at once, it's causing some interesting issues [06:34:31] script that runs on fenari I presume? [06:34:39] no, it runs on every apache [06:34:51] when you start apache from certain other scripts [06:34:53] why wouldn't NTP not be working? [06:35:38] it's actually stopped working a few times [06:35:54] but the best example was November 2005 [06:36:06] hahaha [06:36:07] do tell [06:36:13] * apergos lurks [06:36:19] I know when it was because I put the date in here: https://wikitech.wikimedia.org/view/NTP [06:37:28] well, I see an error there :) [06:37:32] I'll see if I can remember all the details [06:38:07] we were using the NTP service on a router [06:38:27] and because of a firewall rule change, the router stopped getting NTP packets from its upstream servers [06:38:51] it was sending NTP broadcasts to the apaches, and it continued doing so after this happened [06:39:13] but since it was out of sync, it indicated in its NTP packets that it wasn't to be trusted [06:39:27] so none of the servers used it for synchronisation, they all just used their local clocks [06:39:50] makes sense [06:40:00] that's good, as the router may have been drifting a lot [06:40:10] then once the servers had drifted apart by a few seconds, we started seeing some nasty caching issues [06:40:19] even DB consistency issues IIRC [06:40:23] ouch :) [06:40:50] so, why not use nagios ntp checks? [06:40:52] we have a cache in memcached indicating how much the slave servers are lagged [06:41:01] instead of doing things in a script [06:41:10] I don't think nagios existed back then [06:41:23] I think we do monitor it with nagios now [06:41:46] yep, it seems that we're using it now [06:41:46] good [06:41:55] so we have a cache in memcached which just caches things for a few seconds [06:42:31] and there's a mechanism which allows it to be updated randomly just before it expires, which relies on accurate clocks on the apache servers [06:43:09] IIRC that caused a DB overload one time when all servers decided that the cache was always about to expire [06:43:16] ouuch [06:43:32] well, I suppose half of them thought it was about to expire [06:44:32] I guess nagios will keep us safe now [06:45:02] fingers crossed [06:45:04] what I did at the time was to make the local clock of a single server as a stratum 10 NTP server [06:45:20] so that even if we stopped receiving updates from the outside world, the clocks would all stay synchronised [06:45:34] no doubt that technique was lost a while ago [06:46:22] yeah, I see neither dobson nor linne keep time independently [06:47:11] I could just remove the time check, but I'd have to update wikimedia-task-appserver [06:48:10] the apache-sanity-check script is basically an chronicle of war stories like the one I just told [06:48:30] it checks for things that broke the site in the past [06:49:16] hehehe [06:49:26] well, that's very good [06:49:45] I've seen sites where issues are being fixed and noone takes cares to prevent them from happening again [06:54:20] did you figure out what's happening on srv193 btw? [06:54:35] sorry for leaving the updated php packages held there [06:54:40] and thanks for cleaning up [06:55:53] the segfaulting or the puppet failure? [06:56:05] present tense, so the segfaulting I guess [06:56:09] I haven't looked at it yet [06:56:34] segfaulting, yes [06:59:44] does it segfault reliably, or just occasionally? [07:00:04] no idea :) [07:00:11] I wasn't involved at all at that ticket [07:00:24] people were just asking me if I was messing with srv193 again [07:00:28] and I just said I wasn't [07:00:38] (I just came back from 2 weeks "vacation" yesterday) [07:00:57] ah, it comes in bursts [07:02:04] probably APC [07:02:22] i.e. corrupted APC cache, not to say APC is at fault [07:03:57] we had a lot more of that behaviour before I fixed wmerrors to kill the process on timeout [07:06:24] New patchset: Faidon; "Redirect secure.wikimedia.org URLs to proper HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13429 [07:06:54] Reedy: around? [07:07:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13429 [07:07:21] Reedy: when you are, want to take a look at ^^^ (gerrit 13429)? [07:09:55] !log restarted apache on srv193 [07:10:00] never know [07:10:05] Logged the message, Master [07:10:14] Reedy: I derived that list from your old HTTPS everywhere commits, but then didn't include "office"; not sure how can I get a complete, authoritative list [07:10:28] in any case, if I missed something it just won't redirect [07:10:40] but continue to work [07:22:51] I'll put some permanent core dump support in envvars [07:30:36] !log testing envvars/apache2.conf change on srv193 [07:30:44] Logged the message, Master [07:58:27] New patchset: Hashar; "beta: enable DNS blacklist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15768 [07:59:02] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15768 [08:03:52] apergos: do we still have Solaris boxes around? [08:04:44] grrr [08:04:48] way to ruin a guy's day :-P [08:04:50] yes we do [08:05:03] lol [08:05:04] sorry :) [08:05:05] but we are going to get rid of them [08:05:14] i.e. install linux on them real soon now [08:05:30] oh? [08:05:33] sparc or x86? [08:05:40] thumpers iirc [08:05:50] thumpers? [08:06:17] http://wikitech.wikimedia.org/view/Sun_Fire_X4500_and_X4540 [08:06:51] we have linux on most of them I think there's only two left now [08:07:04] ah, x86 [08:07:15] do they actually run puppet? [08:07:26] I'm looking at puppet and there's some solaris stuff there [08:09:11] some do [08:09:22] well the solaris ones.. meh, it's pretty sketchy [08:09:44] in a few months though we'll have images out of swift and those will be regular linux boxes, puppetizd like verything else [08:12:45] nope, no solaris in puppet [08:13:34] just checked servermon :-) [08:14:06] it's so not worth it to fool with those [08:14:20] just nag ben and aaron to hurry up :-D [08:14:36] I don't want to fool with those [08:14:55] I'm just wondering if that if $operatingsystem is Solaris statements are doing anything [08:14:58] or are just remnants [08:15:30] New patchset: Hashar; "enhance rakefile for easy validation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15568 [08:16:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15568 [08:16:55] if they are doing anything, we don't care about it [08:17:17] Change abandoned: Hashar; "too old / nobody care anyway" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/8863 [08:17:22] our approach to solaris is "if it ain't broke don't fix it" [08:17:24] Change abandoned: Hashar; "too old / nobody care anyway" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/8866 [08:18:42] New patchset: Hashar; "redirect (302) /w/ to /w/index.php" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7772 [08:20:36] Change abandoned: Hashar; "dupe of https://gerrit.wikimedia.org/r/#/c/7772/" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/11606 [08:21:22] Change abandoned: Hashar; "unneeded on beta AFAIK" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11873 [08:21:52] Change abandoned: Hashar; "I have a local script to clean them out, so just abandoning that change." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/12151 [08:23:04] New patchset: Hashar; "(bug 37076) `lint` tool require php5-lint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13048 [08:23:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13048 [08:40:11] PROBLEM - Puppet freshness on srv194 is CRITICAL: Puppet has not run in the last 10 hours [08:51:17] PROBLEM - Host srv266 is DOWN: PING CRITICAL - Packet loss = 100% [08:52:07] hashar: heya [08:52:13] I'm back and here [08:56:25] paravoid: have you enjoyed the Debian conferences ? :-] [08:56:33] just one conference :) [08:56:36] but yes, enjoyed it a lot [09:06:59] paravoid: Ryan had some issues roughly 10 days ago when migrating the labs instances to the new hardware [09:07:05] some of the beta instances got corrupted [09:07:16] thanks to puppet that was easy to recover :-] [09:10:36] yeah, I read about it [09:14:45] paravoid: and whenever you have catched up with your backlog, I got a change for you to get ride of the 'beta' NFS instance and uses /data/project instead \O/ [09:16:11] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [09:17:31] !log deployment-prep sync upload6 dirs again. root@deployment-nfs-memc:$ rsync -a --progress --inplace /mnt/export/upload6 /data/project/upload6 [09:17:39] Logged the message, Master [09:19:55] wasn't that supposed to go on #-labs? :) [09:20:11] agrgmmg [09:53:33] hashar: was finishing something, going to lunch out (have some visitors in town), I'll be back in 2h or so and I'll take care of both your requests [09:53:52] paravoid: I am meeting my accountant in 2 hours :-D [09:54:13] but yeah, this afternoon will be great [09:54:17] have a good lunch paravoid ! [09:57:43] New patchset: Faidon; "puppetmaster::self: initial support for modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15777 [09:58:20] New patchset: Faidon; "ntp: move to a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15778 [09:58:54] New patchset: Faidon; "ntp: cleanup, parameterize, enforce style" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15779 [09:58:56] aaand there ;-) [09:59:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15777 [09:59:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15778 [09:59:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15779 [09:59:36] ahhh modules [09:59:37] :) [10:01:58] yeep :) [10:01:59] ttyl [10:02:01] nap time + accountant see you in 2hours and a half ) [10:15:32] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [10:36:47] !log Stopped PyBal on amslvs2, failing over traffic to amslvs4 [10:36:55] Logged the message, Master [10:41:02] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: host 91.198.174.247, sessions up: 4, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [10:44:49] New patchset: Mark Bergsma; "Move amslvs2 configuration to new-style, but IPv6 disabled, for after reinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15783 [10:45:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15783 [10:47:54] !log Reinstalling amslvs2 with Ubuntu Precise [10:48:01] Logged the message, Master [10:52:12] New patchset: Mark Bergsma; "Use the lvs partman recipe for amslvs* as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15784 [10:52:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15783 [10:52:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15784 [10:52:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15784 [10:53:11] PROBLEM - SSH on amslvs2 is CRITICAL: Connection refused [11:02:55] New patchset: Mark Bergsma; "Revert "Use the lvs partman recipe for amslvs* as well" - these don't have hardware RAID." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15785 [11:03:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15785 [11:04:44] PROBLEM - Host amslvs2 is DOWN: PING CRITICAL - Packet loss = 100% [11:10:26] RECOVERY - Host amslvs2 is UP: PING OK - Packet loss = 0%, RTA = 118.01 ms [11:11:40] Wikipedia unreachable in Italy [11:13:37] completely? [11:17:29] RECOVERY - SSH on amslvs2 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:30:01] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [11:31:49] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [11:32:39] !log amslvs2 is back up and serving traffic [11:32:46] Logged the message, Master [11:40:38] !log Stopped PyBal on amslvs1, failing over traffic to amslvs3 [11:40:45] Logged the message, Master [11:42:25] New review: Reedy; "Yaay. Die singer, die!" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13429 [11:44:56] !log Reinstalling amslvs1 with Ubuntu Precise [11:45:05] Logged the message, Master [11:47:07] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: host 91.198.174.247, sessions up: 4, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [11:47:39] New patchset: Tim Starling; "Don't sync apache2.conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15786 [11:48:15] New patchset: Tim Starling; "Use all.conf per I859911db" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15787 [11:48:49] New patchset: Tim Starling; "Permanent core dump support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15788 [11:49:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15786 [11:49:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15787 [11:49:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15788 [11:49:34] mark: is it ok if I merge those? or is it not a good time? [11:49:48] apache config changes? [11:49:52] yeah [11:49:56] now is as good a time as any... unless you want reviews [11:50:47] i'm doing esams lvs reinstalls, but that's completely unrelated and shouldn't affect anything [11:50:53] New patchset: Mark Bergsma; "Make amslvs1 config new-style, IPv6 disabled, for after reinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15789 [11:51:19] PROBLEM - SSH on amslvs1 is CRITICAL: Connection refused [11:51:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15789 [11:51:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15789 [11:51:50] I don't need reviews [11:52:10] then go ahead :) [11:52:31] I already tested all the changes [11:53:06] the commit/review/approve/deploy cycle is a bit too long, you don't want to get to the end of it and find out you made a typo [11:54:19] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15786 [11:54:30] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15787 [11:54:41] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15788 [11:55:22] or in the case of that last one, discover some random feature in bash is not in dash [11:56:02] ` silly [11:58:49] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [12:09:46] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [12:10:22] !log amslvs1 is back up and serving traffic [12:10:29] Logged the message, Master [12:15:40] New patchset: Mark Bergsma; "Cleanup esams lvs configuration now all servers are Precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15791 [12:16:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15791 [12:16:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15791 [12:28:03] !log Stopped PyBal on lvs1003, failing over traffic to lvs1006 [12:28:11] Logged the message, Master [12:34:31] PROBLEM - BGP status on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, sessions up: 9, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [12:34:48] !log Reinstalling lvs1003 with Ubuntu Precise [12:34:55] Logged the message, Master [12:37:23] PROBLEM - SSH on lvs1003 is CRITICAL: Connection refused [12:38:08] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:47:52] New patchset: Mark Bergsma; "Enable IPv6 for lvs1003 and lvs1006" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15792 [12:48:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15792 [12:53:18] !log Fixed boot order on lvs1003 [12:53:25] Logged the message, Master [12:54:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15792 [12:57:02] RECOVERY - SSH on lvs1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [12:58:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15730 [12:58:38] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15674 [12:59:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15683 [13:00:06] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15570 [13:06:02] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [13:07:42] !log changed order on ganglia swift view graphs to group by metric rather than host [13:07:49] Logged the message, Master [13:08:27] !log lvs1003 is back up and serving traffic [13:08:35] Logged the message, Master [13:09:02] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 10, down: 0, shutdown: 0 [13:09:53] !log changed auth URL for swift to use load balancer rather than round robin DNS [13:10:01] Logged the message, Master [13:15:02] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:15:06] New patchset: Hashar; "varnish config for bits.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [13:15:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13304 [13:33:09] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15763 [13:57:58] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:58:27] New patchset: Hashar; "varnish config for bits.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [13:59:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13304 [14:03:40] New review: Hashar; "Patchset 10 adds the $top_domain and $bits_domain which are interpreted in the bits.inc.vlc.erb temp..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13304 [14:06:42] New review: Hashar; "I have no idea what you mean by adding configuration. My change was mostly to fix the syntax error w..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15445 [14:08:24] New patchset: Mark Bergsma; "Make lvs1001 configuration new-style, with IPv6, for after reinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15797 [14:08:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15797 [14:11:07] !log Stopped PyBal on lvs1001, failing over traffic to lvs1004 [14:11:14] Logged the message, Master [14:14:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15797 [14:14:53] new-style! [14:15:03] yeah i'm gonna do them all [14:15:18] i hate loose ends ;) [14:15:21] :) [14:15:23] me too! [14:15:34] speaking of loose ends, we still have static routes to the tunnels :) [14:15:43] i knew you were gonna say that :) [14:15:49] but that still needs a tiny bit of pybal development [14:15:55] currently pybal doesn't retract routes ever [14:16:34] PROBLEM - BGP status on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, sessions up: 9, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [14:16:34] we can always use quagga... [14:16:41] I like quagga :) [14:16:48] (I like pybal too :P) [14:18:07] i don't like quagga [14:18:12] after having to debug it several times :P [14:19:43] PROBLEM - SSH on lvs1001 is CRITICAL: Connection refused [14:20:53] New patchset: Mark Bergsma; "Cleanup eqiad LVS configuration, use a selector" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15799 [14:21:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15799 [14:21:42] New patchset: Mark Bergsma; "Cleanup eqiad LVS configuration, use a selector" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15799 [14:21:43] why are puppet git pushes so slow lately [14:22:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15799 [14:22:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15799 [14:24:55] how did you handle the puppet key signing in esams after reinstall? copy them from sockpuppet? [14:29:01] which one? [14:29:06] oh [14:29:08] hm, I think so [14:29:13] don't remember much [14:31:53] are you sure that you want to handle all servers yourself :) [14:32:02] (questionmark) [14:32:10] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:34:02] you can do one or two if you want ;) [14:34:05] there are two left [14:34:08] I don't mind either way [14:34:12] just want to get it done now ;) [14:34:46] if there are just two the overhead of me catching up will be bigger me thinks. [14:35:24] I was thinking, you did the previous half, I'll handle this one ;) [14:36:18] yeah, I want to finish up with the modules stuff today too [14:36:33] did you see that I introduced the first puppet module? :) [14:36:44] haven't merged it yet, I have to make sure sockpuppet/stafford will handle them [14:36:54] yep [14:36:59] i was thinking of doing the same actually :) [14:37:04] recently [14:37:10] just converting some simple stuff into modules to get started [14:37:10] hate hate hate the two puppetmasters, I'll probably fix that soon now too [14:37:19] yeah, I started with ntp :) [14:37:19] cool [14:43:16] PROBLEM - Host lvs1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:43] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 10, down: 0, shutdown: 0 [14:44:11] !log lvs1001 is back up and serving traffic [14:44:19] Logged the message, Master [14:44:22] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [14:46:35] !log Stopped PyBal on lvs1002, failing over traffic to lvs1005 [14:46:42] Logged the message, Master [14:47:00] !log Reinstalling lvs1002 with Ubuntu Precise [14:47:08] Logged the message, Master [14:48:13] !log Stopped PyBal on lvs4, failing over traffic to lvs3 [14:48:21] Logged the message, Master [14:49:50] New patchset: Mark Bergsma; "Convert lvs1002 configuration to new-style, with IPv6, for after reinstallation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15803 [14:50:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15803 [14:51:13] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, sessions up: 7, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [14:51:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, sessions up: 9, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [14:51:57] !log Reinstalling lvs4 with Ubuntu Precise [14:52:04] Logged the message, Master [14:54:29] New patchset: Mark Bergsma; "Remove Precise test for eqiad LVS servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15804 [14:55:07] New patchset: Mark Bergsma; "Enable IPv6 for lvs4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15805 [14:55:12] PROBLEM - SSH on lvs1002 is CRITICAL: Connection refused [14:55:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15803 [14:55:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15804 [14:55:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15804 [14:55:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15805 [14:55:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15805 [14:56:33] PROBLEM - SSH on lvs4 is CRITICAL: Connection refused [14:59:07] installs in pmtpa are clearly faster than in eqiad [14:59:45] yeah [14:59:48] a lot faster [14:59:53] tftp roundtrip probably [15:00:00] no, that's just during boot [15:00:13] there's tftp in eqiad [15:00:18] but the ubuntu mirror is in pmtpa only [15:00:25] 26ms away [15:01:21] RECOVERY - SSH on lvs4 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:06:45] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [15:06:51] !log lvs4 is back up and serving traffic [15:06:59] Logged the message, Master [15:08:21] New patchset: Alex Monk; "Fix a logo screw up I made in change 15730." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15807 [15:08:42] RECOVERY - SSH on lvs1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:13:42] New patchset: Hashar; "rsyslog should send logs to $::syslog_server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14090 [15:14:11] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/14090 [15:15:45] !log Fixed serial console redirection after boot to OFF on lvs1002 [15:15:53] Logged the message, Master [15:16:00] New patchset: Hashar; "rsyslog should send logs to $::syslog_server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14090 [15:16:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/14090 [15:16:48] PROBLEM - Host lvs1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:06] New review: Hashar; "Renamed $syslog_server to $syslog_remote_server" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/14090 [15:18:47] !log lvs1002 is back up and idling [15:18:55] Logged the message, Master [15:18:57] there [15:18:58] all done [15:19:21] RECOVERY - Host lvs1002 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [15:19:39] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 10, down: 0, shutdown: 0 [15:30:32] mark: do you have any plan for Lucid -> Precise upgrades? [15:47:06] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [15:47:34] New review: Reedy; "Wouldn't it make more sense for the php script to have the dependency on a php module?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13048 [15:49:57] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [16:12:04] checkin [16:32:26] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:02] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [16:35:07] New patchset: Cmjohnson; "Adding db63 -db77 to the dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15748 [16:35:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15748 [16:36:11] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [16:48:36] New review: RobH; "There are whitespace errors, you may want to enable whitespace notation in your editor." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/15748 [17:03:11] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [17:06:56] PROBLEM - Puppet freshness on srv193 is CRITICAL: Puppet has not run in the last 10 hours [17:12:54] New patchset: Cmjohnson; "Adding db63 -db77 to the dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15748 [17:13:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15748 [17:19:25] New patchset: Cmjohnson; "Adding db63 -db77 to the dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15748 [17:20:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15748 [17:24:07] New review: RobH; "whitespaces slain, all is good" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/15748 [17:28:27] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15748 [17:35:56] db63-77? [17:36:04] I don't think I'm the right person to ask about those [17:36:13] asher likely is [17:36:29] but, the ones created for labs should likely not use the es.cfg [17:36:40] New patchset: Pyoungmeister; "role/apache.pp making realm a qualified var and cleaning up some stuff that's no longer used" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15814 [17:37:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15814 [17:43:27] New patchset: Cmjohnson; "adding db63-77 to use the es.cfg in netboot.cfg file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15816 [17:44:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15816 [18:03:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15814 [18:41:23] PROBLEM - Puppet freshness on srv194 is CRITICAL: Puppet has not run in the last 10 hours [18:51:08] RECOVERY - Puppet freshness on srv194 is OK: puppet ran at Tue Jul 17 18:50:56 UTC 2012 [19:14:15] cmjohnson1: not for quite a while (month or so, I think). The need isn't particularly high since containers and accounts are now 100% on the new ones. [19:14:53] I'm moving object storage over to the new ones more slowly, and don't want to put the drives in 1-5 until that's more even. [19:15:04] thanks for asking. I appreciate the followup. [19:16:09] I think it won't mount? [19:16:16] one sec. [19:17:23] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [19:19:05] cmjohnson1: in /var/log/dmesg I get "spinning up disk......not responding" and then later "[ 225.803392] sd 0:0:8:0: [sdk] READ CAPACITY(16) failed" [19:20:39] if you're swapping disks atm., I might have a few more for you. [19:22:08] k. gimme 2 minutes to check the other swift hosts. [19:23:28] yeah, there are 5 that show up in puppet as failing. [19:23:54] I'll add the full list to rt-3230 [19:33:08] cmjohnson1: maybe nevermind. the failure characteristic of the other drives I'm looknig at now are different and maybe a config thing instead of a failed drive. [19:33:35] I'm still poking at it, [19:33:49] and I'll add them to the ticket if they're also bad. [19:51:59] cmjohnson1: I made a new ticket instead of appending to the old one since its subject was specific to ms-be5. [19:52:10] RT-3282 [19:57:17] thanks! [20:17:02] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [20:23:10] paranoid, robh, did one of you fix virt1004? I'm sure it was non responsive when I tried to build it, but now it seems to be up and working... [20:23:30] it has a failed disk [20:23:36] its still failed, but it will work with a failed disk [20:23:52] andrew - it is paravoid, not paranoid ;-P [20:23:53] i have not touched it, and since no one else is onsite, its just doin its raid thing with a bad disk i assume =] [20:24:59] woosters: Dammit, I changed to a new irc client and it has been autocorrecting me all day. Must disable! [20:25:13] RobH: OK, fair enough. I will avoid for now. [20:25:42] though i suspect u did it on purpose :-) [20:26:59] Dammit, not only does it autocorrect, but it waits to do it until I look away from the screen. [20:27:02] Now who's paranoid, I wonder? [20:27:54] * Damianz thinks paravoidia is taking over andrewbogott [20:31:32] Hey ops people, could https://rt.wikimedia.org/Ticket/Display.html?id=3143 be dealt with please? It's a shell access request that's been sitting in RT for almost a month, with a Gerrit change and all the right approvals and stuff [20:31:40] Although I suppose it still needs woosters's approval [20:33:52] New review: Hashar; "Tim , that is the basic 'sync' script in apache config that just trigger sync-apahe." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/15440 [20:40:41] Ryan_Lane: I just sent you a bunch of review requests on Gerrit, including the secure.wm.o redirects stuff. My puppetization of Parsoid is basically ready, I'll submit it tomorrow [20:43:51] Reedy: Could you peek at https://gerrit.wikimedia.org/r/#/c/14744/ and https://gerrit.wikimedia.org/r/#/c/14748/ ? minor patches for wmf-config, I'd like to get them off my dashboard [20:44:55] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14748 [21:00:26] New patchset: Platonides; "Enable Captcha in wmflabs. Much more useful than complaining about bots and how account creation should be disabled (bug 38391)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15839 [21:04:31] New patchset: Cmjohnson; "Adding db63-77 to netboot.cfg to use es.cfg." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15841 [21:05:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15841 [21:07:00] New patchset: Andrew Bogott; "Add a ceph-specific partman recipe." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15842 [21:07:35] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15842 [21:12:24] Change abandoned: Cmjohnson; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15841 [21:12:38] Change abandoned: Cmjohnson; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15816 [21:15:26] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:20:59] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [21:25:20] PROBLEM - SSH on virt1001 is CRITICAL: Connection refused [21:30:44] PROBLEM - Puppet freshness on ms3 is CRITICAL: Puppet has not run in the last 10 hours [21:32:10] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [21:32:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [21:33:57] New review: Hashar; "patchset 2 fix a wrong path (thumbs -> upload6)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15545 [21:38:11] New review: Hashar; "Awesome. I should have though about using Apache directly :( Might want to add that to wikitech wik..." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/15720 [21:40:44] New review: Hashar; "Hmm maybe the mounts from projectstorage.pmtpa.wmnet should be of type fuse.glusterfs :-D" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15545 [21:43:02] PROBLEM - NTP on virt1001 is CRITICAL: NTP CRITICAL: No response from NTP server [21:48:26] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:50:05] PROBLEM - Host virt1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:51:37] New patchset: Hashar; "(bug 38391) enable Captcha in wmflabs." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15839 [21:52:03] New review: Hashar; "rewrote commit message" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/15839 [21:52:21] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15839 [21:52:38] RECOVERY - Host virt1003 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [21:59:05] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms [22:02:41] PROBLEM - SSH on virt1002 is CRITICAL: Connection refused [22:07:29] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:08:48] tough month [22:10:38] RECOVERY - SSH on virt1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:10:42] oops wrong window [22:10:47] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.43 ms [22:18:45] RECOVERY - SSH on virt1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:24:16] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14744 [22:29:55] New patchset: Andrew Bogott; "2nd attempt at changing partman for ceph" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15851 [22:30:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15851 [22:30:47] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15851 [22:36:45] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:38:51] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [22:41:15] PROBLEM - NTP on virt1002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:41:58] Ryan_Lane: the above you ? [22:42:17] no. likely andrewbogott [22:42:18] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms [22:46:00] well as long as it's nothing to worry about [22:46:30] PROBLEM - SSH on virt1001 is CRITICAL: Connection refused [22:47:03] [23:06:09] RECOVERY - SSH on virt1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:06:54] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [23:15:54] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:24:46] !log on srv193: removed core dump files, disabled core dumping, restarted apache [23:24:53] Logged the message, Master [23:28:51] paravoid: If you are awake, can you look at a partman script for me? [23:29:02] virt-raid10-cisco-ceph.cfg in the puppet head [23:39:31] New patchset: Catrope; "Add a service class for Parsoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15856 [23:40:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15856 [23:40:30] RobH: So I filed that procurement ticket https://rt.wikimedia.org/Ticket/Display.html?id=3271 [23:40:40] Hoping you've still got a machine in Tampa [23:42:43] RoanKattouw: ohai:D [23:42:49] Hey hey! [23:42:55] * r0csteady is at oscon [23:43:00] Ah! [23:43:05] * RoanKattouw sadly is not [23:43:16] I spoke at OSCON last year though [23:43:20] noice! [23:43:53] Just got back from Wikimania (in DC), I managed to get sick immediately before and immediately after that trip but not during :S [23:44:35] awww [23:44:40] hope you feel better soon! [23:44:58] I'm back at work already, they were brief 12-hour things [23:45:28] Although... I hope that's what this one's gonna be, I don't quite feel 100% yet [23:45:56] hm… notpeter, partman help? [23:51:17] New patchset: Andrew Bogott; "Increasingly frantic attempts to get partman's attention" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15858 [23:51:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15858 [23:52:07] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15858 [23:55:30] r0csteady: Did you hear about the Puppet Labs party tonight after the conference? [23:57:37] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:59:25] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours