[00:08:31] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 00:08:28 UTC 2013 [00:09:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:09:41] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 00:09:37 UTC 2013 [00:10:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:10:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 00:10:40 UTC 2013 [00:11:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:41] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 00:11:37 UTC 2013 [00:12:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:31] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 00:12:27 UTC 2013 [00:13:08] TimStarling: ping [00:13:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:13:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 00:13:41 UTC 2013 [00:13:54] TimStarling: I've got a HHVM question for you [00:14:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:14:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 00:14:44 UTC 2013 [00:15:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:30:26] preilly: If you just ask the question, I'm sure TimStarling will respond when hes sees it [00:43:42] RECOVERY - Puppet freshness on mc15 is OK: puppet ran at Fri May 17 00:43:36 UTC 2013 [00:44:52] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 00:44:43 UTC 2013 [00:45:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:49:47] drwxr-xr-x 2 aaron wikidev 4096 May 14 20:12 5a [00:49:47] drwxr-xr-x 2 aaron wikidev 4096 May 14 20:14 67 [00:49:51] on 1.22wmf4 [00:50:37] !log reedy synchronized php-1.22wmf3/maintenance/checkUsernames.php [00:50:46] Logged the message, Master [00:52:01] Can someone add group write recursively to /a/common/php-1.22wmf4/.git/objects/5a and /a/common/php-1.22wmf4/.git/objects/67 on tin please? [01:14:49] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 01:14:48 UTC 2013 [01:15:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [02:07:15] !log LocalisationUpdate completed (1.22wmf4) at Fri May 17 02:07:14 UTC 2013 [02:07:24] Logged the message, Master [02:11:42] New patchset: Diederik; "Set cwd for git log when determining version info" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64252 [02:12:45] !log LocalisationUpdate completed (1.22wmf3) at Fri May 17 02:12:44 UTC 2013 [02:12:54] Logged the message, Master [02:13:19] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [02:13:19] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [02:17:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [02:33:21] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 17 02:33:21 UTC 2013 [02:33:30] Logged the message, Master [02:39:03] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [02:39:09] New review: MZMcBride; "Heh, I had a tingling feeling about 'wikipedia' =>, but I couldn't figure it out. I guess there's no..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64109 [02:39:22] New review: MZMcBride; "This is a follow-up to I2d571028c3fbf5a9f5a27c7a68e5048db49ab122." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64109 [02:40:31] New review: MZMcBride; "Follow-up changeset: Idd62b43535cf1b19127421c2f10d3caee4c38f79." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63877 [02:40:42] New review: MZMcBride; "... in Idd62b43535cf1b19127421c2f10d3caee4c38f79." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64110 [03:17:39] hi, could i push ext/zero out now? should fix the bug we are seeing in logs [03:20:07] greg-g, ^ [03:22:54] TimStarling, ^? [04:02:01] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:08:01] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 04:07:55 UTC 2013 [04:08:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:08:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 04:08:50 UTC 2013 [04:09:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:09:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 04:09:40 UTC 2013 [04:10:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:31] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 04:10:20 UTC 2013 [04:11:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:14:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 04:14:49 UTC 2013 [04:15:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:47:55] yurik: i think as long as no one here has said anything about a conflicting deploy then you can iff you can also stay for a bit (an hour at least I guess) to watch for new problems and clean up whatever you broke [04:48:45] jeremyb, heh, sounds like a good plan :) [04:49:44] but i really would like to get someone in ops to ok this [04:50:10] yurik: heya, what jeremyb said is probaby good (but don't tell anyone that it is Friday where you live ;) ) [04:50:19] what do you need reviewed/ [04:50:20] ? [04:50:32] (not that I can really review it) [04:50:55] greg-g, nothing reviewed - i just cehcked, it seems the mw-config with my change has gone live already, so i'm all good to go - just my extension [04:51:04] * greg-g nods [04:51:16] ok, here i go [04:51:18] yeah, pgehres|away did a deploy this afternoon [04:51:30] good :) [04:51:43] i was worried about config thing [04:51:52] (added a new debug log entry "zero" [04:52:03] for the file named zero [04:52:12] sounds reasonable [04:52:21] hope i don't have to do anything crazy like set up dir permissions for the new log [04:52:29] * greg-g doesn't know [04:52:39] i seriously doubt it :) [04:56:02] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [05:06:06] !log yurik synchronized php-1.22wmf3/extensions/ZeroRatedMobileAccess/includes/PageRenderingHooks.php [05:06:14] Logged the message, Master [05:09:37] greg-g, seems like its up and running. One problem though - I tried to sync up wmf4, but I can't do git pull in that dir [05:09:54] error: insufficient permission for adding an object to repository database .git/objects [05:13:30] someone didn't set their umode before deploying [05:13:42] you need a root i thinks [05:14:20] (not sure exactly how it's laid out. and maybe there are other ways. but Reedy usually just hunts down a root) [05:17:08] thx jeremyb ! will ask Reedy to reset perms on wmf4 dir or something [05:17:24] not a big deal - wmf4 is not heavily live yet [06:27:39] PROBLEM - Puppet freshness on colby is CRITICAL: No successful Puppet run in the last 10 hours [06:29:29] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 06:29:24 UTC 2013 [06:30:19] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:30:49] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 06:30:48 UTC 2013 [06:31:19] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:31:19] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 06:31:18 UTC 2013 [06:32:19] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:02:10] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [08:02:10] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [08:02:10] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [08:07:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 08:07:46 UTC 2013 [08:08:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:08:31] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 08:08:27 UTC 2013 [08:09:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:15:01] PROBLEM - Puppet freshness on gallium is CRITICAL: No successful Puppet run in the last 10 hours [08:16:01] PROBLEM - Puppet freshness on db1017 is CRITICAL: No successful Puppet run in the last 10 hours [08:16:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 08:16:48 UTC 2013 [08:17:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:18:38] New patchset: ArielGlenn; "adapt redis template for labs use, update labs redis role settings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64267 [08:44:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 08:44:45 UTC 2013 [08:45:11] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [08:52:53] apergos: finally got an internet connection :-] [08:59:13] New patchset: ArielGlenn; "adapt redis role and redis template for lab use" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64267 [09:00:10] hey there [09:02:01] hmm overcast again, wonder if it will rain [09:04:05] ah puppetization nice [09:09:55] well we shall see if is nice :-D [09:14:52] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 09:14:44 UTC 2013 [09:15:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [10:43:58] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [10:44:13] [06:14:20] (not sure exactly how it's laid out. and maybe there are other ways. but Reedy usually just hunts down a root) [10:44:13] [06:17:08] thx jeremyb ! will ask Reedy to reset perms on wmf4 dir or something [10:44:26] Presumably the same ones I was complaining about a few hours before [10:52:07] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63866 [10:56:25] LordOfLight: why are you supporting stennis ? [10:56:47] eh? [10:57:09] what is stennis? [10:57:33] oh, with a nick like that i was thinking of game of thrones [11:01:45] no spoilers! [11:03:21] haha, Reedy [11:04:02] apparently you added the milionth article to the Spanish Wikipedia [11:05:35] New patchset: Hashar; "tweak jenkins slave authorization key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64272 [11:24:35] New patchset: Hashar; "tweak jenkins slave definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64272 [11:42:17] LeslieCarr: hi! can you help with some bad output from wikidata.org being stuck in squid/varnish? [11:42:27] ...and perhaps in the process enlighten my about cache control headers? [11:42:52] paravoid: or you, maybe? [11:44:29] <^demon> DanielK_WMDE already asked me, but I'm a bit at a loss tbh :) [11:44:37] PROBLEM - DPKG on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:27] RECOVERY - DPKG on mc15 is OK: All packages OK [11:47:34] can I help? [11:49:19] mark: oh hey! [11:49:45] possibly. basically, https://www.wikidata.org/wiki/Special:EntityData/Q60.rdf and https://www.wikidata.org/wiki/Special:EntityData/Q60 have bad output stuck in squid/varnish [11:50:06] interestingly, get get the correct result in firefox. but with wget, the result is empty [11:50:13] adding ?foo to the url gets me the correct reult in wget too [11:50:47] adding maxage=0 gets the right response, but doesn't seem to purge the bad entry from the cache [11:50:51] or at least not from all the caches [11:51:06] of course not, it's a different url :) [11:51:21] ugh :/ [11:51:32] so, what's the best way to purge special page output? [11:51:55] note that that special page sets the cache control header. currently, to $wgSquidMaxAge == 31 days. [11:52:05] (i'm abotu to add a config variable there) [11:52:10] and it also sets a vary header on Accept-Encoding [11:52:16] which is probably why firefox works, wget doesn't [11:52:18] (gzip encoding) [11:52:38] mediawiki may set that, the special page doesn't [11:52:41] but yea, makes sense [11:53:09] did you try ?action=purge ? [11:53:14] i don't think it works for special pages [11:53:16] but I'm not sure [11:53:28] hm... no, didn't try. didn't think it would work :) [11:53:51] looks like it didn't work [11:53:52] nope [11:53:56] still broken [11:54:57] how about new Title("Special:EntityData/Q60")->purgeSquid()? [11:55:05] i could patch that in [11:55:17] actually, i could patch that in as a reaction of action=purge being passed to the special page [11:55:35] sounds hacky [11:55:40] why? [11:56:16] hm, i'm curious what Title::getSquidUrls will return for a special page with subpage syntax :) [11:57:45] mark: eventually, we'd want to purge these thigns from squid whenever the data changes - but that means purging .../Q123, .../Q123.json, .../Q123.xml, .../Q123.rdf, .../Q123.n3, etc etc... [11:58:06] but that's for later- for now, i'd be already happy if i could purge one particular rendering [11:58:13] then you better come up with a really solid and scaleable solution for that instead of hoping it will magically fix it self like ULS :) [11:58:41] you should do that first, not last [11:59:01] mark: in the case of ULS, we supposed that it's a foundation project, and they were Doing It Right. [11:59:17] there are far fewer serialization formats than there are languages, so i don't think it's that much of a problem [11:59:18] anyway [12:00:13] mark: the purging when the data is updated isn't critical. that's just an aside. but now we have broken output stuck there. and it's stuck for far longer than I though it would be ($wgSquidMaxAge == 31 days) [12:00:19] ULS magically fix self? [12:00:29] Nemo_bis: nope [12:00:35] http://www.wikidata.org/wiki/Special:EntityData/Q60.rdf [12:00:39] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64272 [12:00:43] Date: Thu, 16 May 2013 19:26:34 GMT [12:00:48] this was served by mediawiki/apache yesterday [12:00:53] the empty response [12:01:19] yes [12:01:26] the code was broken (premature flush) [12:01:30] we fixed that [12:01:40] now we want the broken responses gone from the cache [12:02:01] it'S not totally critical, it's not really used yet. it just randomly interferes with testing [12:02:28] and i just don't know how many broken responses are cached. i just know of two versions of Q60. there may be many more. [12:02:55] mark: to me it's realyl a generaly question - if I have a URL, can I purge it? Or rather, can you? who can? [12:03:11] I believe I just purged the Q60 url [12:04:04] there's a maintenance script for purging [12:04:06] mark: looks like it! awesome! can you do that again without the .rdf at the end? [12:04:08] purgeList.php [12:04:11] just did [12:04:15] thanks [12:04:48] ok, so, as long as we know the exact urls to purge, that's doable. but finding all urls that need purging is going to be hard [12:06:10] yup [12:06:37] mark: right. thanks again! [12:06:49] <^demon> Yay :) [12:06:54] * DanielK_WMDE gallops off to make this more flexible [12:07:55] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 12:07:53 UTC 2013 [12:08:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:08:55] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 12:08:50 UTC 2013 [12:09:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 12:09:40 UTC 2013 [12:10:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:10:26] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 12:10:24 UTC 2013 [12:11:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:13:45] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [12:13:45] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [12:15:05] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 12:15:01 UTC 2013 [12:15:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:24:55] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Fri May 17 12:24:51 UTC 2013 [12:25:10] Can someone add group write recursively to /a/common/php-1.22wmf4/.git/objects/5a and /a/common/php-1.22wmf4/.git/objects/67 on tin please? [12:25:19] i'll have a look [12:25:38] AaronSchulz seems to have a bad umask [12:26:09] done [12:26:29] Thanks [12:26:30] Reedy: perhaps some git hook could check for that? ;) [12:26:37] heh [12:26:44] ^demon: It's getting worse... [12:26:46] Fetching submodule extensions/ZeroRatedMobileAccess [12:26:46] error: The requested URL returned error: 403 while accessing https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ZeroRatedMobileAccess.git/info/refs [12:26:46] fatal: HTTP request failed [12:26:57] Meh, worked 2nd time [12:27:14] <^demon> I still say it's not gerrit's fault. [12:27:53] Hmm, no. Zero is doing it on demand [12:28:09] Why is it only a couple of repos that seem to be borked? [12:28:46] <^demon> Nothing's wrong with those repos :\ [12:28:56] !log reedy synchronized php-1.22wmf4/maintenance/checkUsernames.php [12:29:04] Logged the message, Master [12:30:40] New patchset: Hashar; "contint: fix openjdk packages names" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64280 [12:31:24] reedy@terbium:~$ foreachwiki [12:31:24] /usr/local/bin/foreachwikiindblist: line 4: /a/common/all.dblist: No such file or directory [12:31:38] It's only in /a/common on tin... [12:37:24] Reedy: can you merge https://gerrit.wikimedia.org/r/#/c/64221/ so that I fix what I broke? [12:39:52] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [12:43:06] New patchset: Hashar; "contint: jenkins slave user had no home" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64282 [12:45:02] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 12:45:00 UTC 2013 [12:45:12] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:48:59] New patchset: Petrb; "motd is now project wide" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64285 [12:53:27] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64221 [12:55:00] ^demon: Won't let me update ProofreadPage now.. [12:55:15] !log reedy synchronized wmf-config/InitialiseSettings.php [12:55:23] Logged the message, Master [12:56:08] <^demon> I'm wondering if we could do all the submodules as ssh. [12:56:21] <^demon> Wouldn't hurt, and I know ssh tin -> manganese works. [12:57:09] wget gerrit.wikimedia.org works fine ;) [12:57:14] <^demon> Yeah ;-) [12:57:34] New review: coren; "profile.d can mess with noninteractive sessions if it outputs stuff; this would be better done in up..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/64285 [12:58:00] <^demon> Adjusting make-wmf-branch to create the submodules with ssh would be trivial. [12:58:07] <^demon> Adjusting the current clone would be a bit more annoying. [12:58:36] move it [12:58:38] reclone [12:58:41] move l10n back in [12:58:45] <^demon> Or that. [12:58:47] push for consistency [12:59:12] <^demon> I was thinking something like edit .gitmodules then `git submodule foreach git remote set-url origin ` `git remote update --init` [12:59:22] <^demon> But your way is easier. [12:59:46] won't take too long to do either :D [13:18:57] New patchset: Petrb; "motd is now project wide" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64285 [13:20:40] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64285 [13:27:13] Reedy, hi, i couldn't push out wmf4 yesterday because of an issue with the git repo. I didn't want to break it, so left it as is. git pull was giving perm error [13:27:21] I saw [13:27:31] do you know what caused it? [13:27:37] bad umask by AaronSchulz [13:27:42] I reported the same problem in here a few hours before you [13:27:50] mark fixed it an hour or so ago [13:27:57] But you can't currently update Zero either [13:29:01] that's ok - wmf4 has not been pushed to wikipedias yet [13:30:01] Reedy, i am more worried about the 76 Warning: Recursion detected in RequestContext::getLanguage in /usr/local/apache/common-local/php-1.22wmf3/includes/context/RequestContext.php on lin [13:30:02] e 281 [13:30:09] Why? [13:30:18] It's been happening for a while [13:30:19] !log Zuul: applying project templates {{gerrit|63674}} [13:30:28] Logged the message, Master [13:30:36] i'm wondering if that's my call that's causing it [13:30:57] doubtful, but since i don't see the callstack... [13:32:38] We should probably try and set PHP up to log the callstack somewhere [13:33:56] yes, that's what grownup devs have always wanted for their birthdays :) [13:34:45] Reedy, one question though - wmf4 is scheduled to go live on monday. Will it pick up the latest wmf4, including submodules, or will they simply deploy whatever is in wmf4 dir on tin? [13:34:59] I usually make sure things are all upto date [13:35:59] ok, pls make sure the zero submodule is updated - i have already comited the wmf4 ver. Thx! [13:36:17] PROBLEM - RAID on analytics1016 is CRITICAL: Timeout while attempting connection [13:37:47] PROBLEM - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:54] As I said, it's currently broken [13:38:27] reedy@tin:/a/common/php-1.22wmf4$ git submodule update extensions/ZeroRatedMobileAccess [13:38:27] error: The requested URL returned error: 403 while accessing https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ZeroRatedMobileAccess.git/info/refs [13:38:28] fatal: HTTP request failed [13:38:28] Unable to fetch in submodule path 'extensions/ZeroRatedMobileAccess' [13:38:28] reedy@tin:/a/common/php-1.22wmf4$ [13:38:37] RECOVERY - Host analytics1016 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:39:27] PROBLEM - RAID on analytics1017 is CRITICAL: Timeout while attempting connection [13:39:36] Reedy, i understand, but i thought you were fixing perms there? [13:39:52] It's not permissions [13:39:55] Read the error? [13:40:02] And I said, the permission issue is already fixed [13:40:57] PROBLEM - Host analytics1017 is DOWN: PING CRITICAL - Packet loss = 100% [13:41:57] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [13:43:30] ohh. oops, sorry, didn't read it... and Reedy, i 'm getting http 406 when i click git link (i wonder if that's authentication). Regardless, I am not sure what the cause of the error is - git is down again? [13:45:35] yurik: a change was merged ~10m ago [13:48:23] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64280 [13:49:28] p858snake|l, thanks! [13:49:51] my concern is that i don't know what to do in case i see such errors during deployment [13:50:19] best: scream loudly [13:51:42] thx mark! I know how to do that well! :) [13:56:51] get a mat out and start sending the smoke signals towards Reedy [13:58:10] yurik: then open up a tab account at the closest stroopwafel provider near Reedy [13:58:40] * yurik googles stroopwafel [13:58:51] oooo! [13:58:52] yam [13:59:56] I suspect WMNL will have sourced many for next week [14:01:58] New patchset: Hashar; "contint: jenkins slave user had no home" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64282 [14:02:47] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:45] Change abandoned: Andrew Bogott; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64105 [14:11:17] New review: Milimetric; "This will fix the problem we had getting the version to show in prod." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/64252 [14:14:14] preilly: I have made jenkins to block sartoris changes whenever pyflakes fail [14:14:39] Reedy, i will git update the zero submodule on tin, but won't push out anything [14:15:02] will let you do the honors on monday :) [14:22:06] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64282 [14:22:59] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64252 [14:45:57] New patchset: Faidon; "Varnish: remove send_timeout=30, rely on the default (600s)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64314 [14:47:04] New patchset: Faidon; "Varnish: remove send_timeout=30, rely on the default (600s)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64314 [14:48:04] New review: Faidon; "Troubleshooted with mark, implies at least a +1 :)" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/64314 [14:48:04] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64314 [14:56:17] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [14:56:19] New review: Anomie; "It seems the _SOURCE names won't work on terbium, for example. But it doesn't work now either, so I'..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55059 [14:58:14] New patchset: Ottomata; "Making misc::limn::instance more configurable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64316 [14:58:38] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64316 [14:59:24] !log setting Varnish send_timeout to 600 on upload, mobile, bits (eqiad), upload, bits (esams) for both frontend & backend [14:59:29] did I miss anything? [14:59:33] Logged the message, Master [15:04:16] New patchset: Ottomata; "Fixing typo in variable name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64318 [15:04:29] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64318 [15:26:14] New patchset: Ottomata; "Piping stderr to limn log file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64319 [15:26:30] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64319 [15:41:34] New patchset: Ottomata; "Ensuring mod proxy_http is enabled for statistics apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64321 [15:41:42] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64321 [16:05:48] New patchset: Ottomata; "Not redefining $base_directory in limn::instance define" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64324 [16:06:48] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64324 [16:08:00] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 16:07:58 UTC 2013 [16:08:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:09:10] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 16:09:00 UTC 2013 [16:09:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:10:00] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 16:09:57 UTC 2013 [16:10:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:10:50] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 16:10:47 UTC 2013 [16:11:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:12:10] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 16:12:04 UTC 2013 [16:12:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:15:00] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 16:14:54 UTC 2013 [16:15:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:24:56] say you have a screen session on a server with an always on irssi client. you connect via ssh and `screen -R freenode` to enter the session, any suggestions for how to get ping's (highlights) forwarded into the desktop linux machine your usings notification system? [16:26:32] it would seem it might need a few pieces, perhaps when you connect you forward a particular port from the server back to your machine. irssi could send the notification over that port and a server locally could receive it and generate the notification, i could write that up but was thinking theres probably a simpler solution [16:26:46] s/server locally/daemon locally/ [16:27:35] I know on Windows and Mac there are tools for doing this [16:28:00] PROBLEM - Puppet freshness on colby is CRITICAL: No successful Puppet run in the last 10 hours [16:28:13] doh i just realized i'm asking in the wrong channel, meant to ask in #irssi :P [16:28:14] ebernhardson: Did you try googling? http://jonathanbeluch.com/blog/2011/03/remote-notify-irssi-screen/ [16:28:25] but thanks i will check that out :) [16:35:11] RECOVERY - Disk space on ms-be9 is OK: DISK OK [16:36:07] sudo -u apache mwscript checkUsernames.php zhwiktionary | tee ~/checkUsernames.log [16:36:25] sudo -u apache ./foreachwiki checkUsernames.php | tee ~/checkUsernames.log [16:36:26] even [16:36:50] Any idea why all that gets appended to the log is the output from the top foreachwiki script? Both appear in the log [16:37:14] reedy@terbium:~$ sudo -u apache mwscript checkUsernames.php zhwiktionary | tee ~/checkUsernames.log [16:37:14] zhwiktionary: 120: '約翰 可比西都羅農' [16:37:14] reedy@terbium:~$ cat ~/checkUsernames.log [16:37:14] reedy@terbium:~$ [16:39:17] I'm guessing it's due to where the output is being written, or not, in this case [16:40:28] Hmm, 2>&1 fixes it [16:40:36] sudo -u apache mwscript checkUsernames.php zhwiktionary 2>&1 | tee ~/checkUsernames.log [16:40:55] Reedy: having a nice conversation with yourself? [16:40:59] Yup [16:41:21] * Reedy high fives himself [16:41:24] <^demon> I assume most people are pretty interested in what they have to say ;-) [16:41:44] * pgehres calls a physciatrist for Reedy  [16:41:52] mutante: is this the security issue being fixed (and are you ok/not ok with me linking to it in my message)? https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1179943 [16:42:13] that must be it, it was released on the 15th [16:42:13] you've just posted it in public.. [16:42:28] yeah, it's a public bug, if anyone knows we use Ubuntu, they know what the issue is [16:42:43] or, really, if they know we use the linux kernel :) [16:42:59] It was just confusing as you were asking if you could post it somewhere else at the same time [16:43:23] yeah, true, nevermind :) [16:43:50] eh, i think Reedy already answered what i thought:) [16:45:07] commonswiki: 10712: 'Ff02::3' [16:45:10] We have some awesome usernames [16:46:01] huh [16:46:15] I'm running a script to get a list of all invalid usernames [16:46:26] That looks like a variant of an IPv6 address [16:46:55] IPv6 multicast [16:47:11] commonswiki: 1287235: 'ɑdmins eating elephant poo' [16:47:13] * Reedy grins [16:49:07] <^demon> !log rebooting antimony [16:49:15] Logged the message, Master [16:50:51] PROBLEM - Host antimony is DOWN: CRITICAL - Host Unreachable (208.80.154.7) [16:51:20] Reedy: all from commons so far, heh [16:51:34] There's more than that [16:51:36] They were just amusing ;) [16:51:41] * greg-g nods [16:51:41] RECOVERY - Host antimony is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:52:06] 54 and we're onto dewiki [16:54:24] dewiki: 364940: 'WP:CU' [16:55:57] heh [16:55:58] Reedy: what are you planning to do with this list of usernames? [16:56:04] DELETED! [16:56:20] pgehres: Post them on a bug [16:56:29] And decide if I care enough to work out a plan to "fix" them [16:56:42] https://bugzilla.wikimedia.org/show_bug.cgi?id=3507 [16:56:56] Ah, k. I'll be interested to see how many of them are targets of the SUL finalisation [16:57:48] Reedy: http://en.wikipedia.org/wiki/User:Recentchanges :) [16:58:14] typo squatter in user space,, hehe [16:59:30] nice [16:59:45] 377 [16:59:47] We're onto enwiki [17:00:27] pgehres: CA bug for you in -tech [17:04:00] New patchset: Odder; "(bug 48308) Change namespace settings for ukwikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64336 [17:04:38] gah, i search for africa (looking for south african chapter website) and all i get is RT 1081 :-P [17:12:12] robh: have time to walk through Cerium disk addition? [17:23:24] cmjohnson1: sorry, was moving into office [17:23:26] !log rebooting kaulen (Bugzilla) in 5 minutes, please save your bugs and expect 5 minutes downtime [17:23:27] here now [17:23:36] Logged the message, Master [17:23:47] mutante: Do you have any checklist of all servers? [17:23:56] list a dsh loop of kernel versions ? [17:24:12] robh: cool [17:24:15] RobH: i'm creating it right now, i made a temp. dsh group called "pubservers" [17:24:33] well, why not do all servers, and have the list be name, kernel, ip? [17:24:42] then we have a single master list and can tackle it via google spreadsheet? [17:27:15] mutante, there's actually a bug for something to do on bugzilla whilst it's down [17:27:44] https://bugzilla.wikimedia.org/show_bug.cgi?id=47013 [17:29:44] PROBLEM - Host snapshot1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:30:01] !log rebooting kaulen [17:30:09] Logged the message, Master [17:30:37] <^demon> !log rebooting formey (svn) in 2 minutes...why are you still using SVN? [17:30:45] *G* [17:30:46] Logged the message, Master [17:31:06] New review: ArielGlenn; "This change has been tested right out of gerrit in labs, but I'd still prefer that it get reviewed b..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64267 [17:31:34] Thehelpfulone: would you know where this actually is though? [17:31:53] i don't know the context at all yet [17:32:04] PROBLEM - DPKG on snapshot1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:32:04] PROBLEM - Host kaulen is DOWN: PING CRITICAL - Packet loss = 100% [17:32:33] New patchset: Wpmirrordev; "Fix for compatibility with help2man and Debian Policy" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/64343 [17:32:43] mutante, where the config change is you mean? the request it to remove wikibugs-l from all CC lists because it's a global watcher and some security things - andre__ wanted to wait until the next bugzilla upgrade to do it so as to not disable global bugmail whilst people are using bugzilla [17:33:24] RECOVERY - Host snapshot1001 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [17:33:38] errm, context? [17:33:55] PROBLEM - Host formey is DOWN: PING CRITICAL - Packet loss = 100% [17:33:55] PROBLEM - NTP on snapshot1001 is CRITICAL: NTP CRITICAL: Offset unknown [17:34:02] mutante: /etc/bugzilla/ ? [17:34:15] !log bugzilla is back [17:34:24] Logged the message, Master [17:34:35] RECOVERY - Host kaulen is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [17:34:50] andre__, mutante was rebooting bugzilla and I thought of https://bugzilla.wikimedia.org/show_bug.cgi?id=47013 [17:34:55] RECOVERY - Host formey is UP: PING OK - Packet loss = 0%, RTA = 26.61 ms [17:35:05] RECOVERY - DPKG on snapshot1002 is OK: All packages OK [17:35:38] Thehelpfulone, I wouldn't extend downtime when it was unannounced [17:36:39] <^demon> Hrm, formey responds to ping, but can't ssh. [17:36:46] Thehelpfulone: i'm talking to andre__ .. we can do this if necessary [17:36:49] ^demon: lemme check mgmt [17:36:55] RECOVERY - NTP on snapshot1001 is OK: NTP OK: Offset 0.001713275909 secs [17:37:05] PROBLEM - SSH on formey is CRITICAL: Connection refused [17:37:15] PROBLEM - HTTP on formey is CRITICAL: Connection refused [17:37:25] PROBLEM - HTTPS on formey is CRITICAL: Connection refused [17:37:31] Thehelpfulone, again, not a good opportunity. [17:37:41] I won't extend unplanned downtime. [17:37:46] mutante, ^ [17:38:02] okay [17:38:15] !log rebooted snapshots 1001-4 (new kernel) [17:38:15] as part of a planned downtime: yes. but that's not the case. [17:38:24] Logged the message, Master [17:38:45] ^demon: no output on mgmt, about to powercycle [17:38:52] oh wait.. something now [17:39:04] coming up [17:39:15] RECOVERY - HTTP on formey is OK: HTTP OK: HTTP/1.1 200 OK - 3596 bytes in 0.094 second response time [17:39:17] wait, svn won't come back? [17:39:25] RECOVERY - HTTPS on formey is OK: OK - Certificate will expire on 08/22/2015 22:23. [17:39:29] andre__, oh okay I thought it was one that had already been scheduled that I missed [17:39:29] there you go [17:39:33] should've left it as it was mutante :-P [17:39:43] i would be soooo ok with that:) [17:40:05] RECOVERY - SSH on formey is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:40:33] <^demon> svn's moving to eqiad soon as a r/o service. [17:45:30] !log shutting down cerium to add add'l disk [17:45:39] Logged the message, Master [17:47:25] PROBLEM - Host cerium is DOWN: PING CRITICAL - Packet loss = 100% [17:49:38] !log updating and rebooting image scalers one by one [17:49:47] Logged the message, Master [17:51:55] PROBLEM - NTP on snapshot1003 is CRITICAL: NTP CRITICAL: Offset unknown [17:52:03] mw1153 [17:53:35] PROBLEM - Host mw1153 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:25] RECOVERY - Host mw1153 is UP: PING OK - Packet loss = 0%, RTA = 4.31 ms [17:55:55] RECOVERY - NTP on snapshot1003 is OK: NTP OK: Offset -0.00431907177 secs [17:57:05] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection refused [17:58:05] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.104 second response time [17:58:21] mw1154 [18:02:45] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [18:02:45] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [18:02:45] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [18:04:25] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection refused [18:06:45] Reedy, regarding your request about creating search indexes… do you know how to do it, or know who does? [18:06:56] RECOVERY - Host cerium is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [18:07:04] https://wikitech.wikimedia.org/wiki/Lucene#Adding_new_wikis [18:07:05] New patchset: Mwalker; "Removing France as a Special Redirect for Fundraising" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64347 [18:07:17] andrewbogott: ^ But it might be better just asking mutante or notpeter to do them as they can be a bit weird [18:07:46] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.078 second response time [18:09:27] hm… notpeter, mutante, either of you interested in taking on https://rt.wikimedia.org/Ticket/Display.html?id=5162? [18:11:29] notpeter: the jobs runners are 12-cores right? [18:11:52] mw1155 [18:13:46] PROBLEM - Host mw1155 is DOWN: PING CRITICAL - Packet loss = 100% [18:14:26] RECOVERY - Host mw1155 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:16:25] !log carbon rebooting for kernel update, tftp in eqiad will be offline for a few [18:16:36] Logged the message, RobH [18:16:36] PROBLEM - Puppet freshness on db1017 is CRITICAL: No successful Puppet run in the last 10 hours [18:18:32] AaronSchulz: yep [18:19:27] notpeter: I'd like to bump the proc count a bit [18:21:25] New patchset: Aaron Schulz; "Bumped job process count to 15." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64351 [18:21:35] notpeter: ^ [18:22:10] looks like hyperthreading is on and there are often a bit less than 12 procs due to how it works...and you want more procs than cpu to mask i/o [18:22:31] 1156 [18:23:26] PROBLEM - Host carbon is DOWN: CRITICAL - Host Unreachable (208.80.154.10) [18:23:48] AaronSchulz: we could also just turn off hyperthreading [18:23:54] it'd be kinda a pita [18:24:00] but it should be off anyway [18:24:26] PROBLEM - Host mw1156 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:56] RECOVERY - Host mw1156 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [18:26:05] notpeter: thanks for the lucene related merges last week :-] [18:27:43] hashar: yep! [18:29:46] bleh, carbon not coming back, investigating. [18:30:19] 1157 [18:30:40] Change merged: Katie Horn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64347 [18:31:06] !log Jenkins: created a Jenkins slave on gallium using jenkins-slave as user and /srv/ssd/jenkins-slave as working directory. [18:31:16] Logged the message, Master [18:31:23] oohhh :-D [18:32:14] fuuuuuuuuu.....asdf243t243t utiefjwadscxz [18:32:18] carbon has disk failure. [18:32:22] sigh. [18:32:30] yay [18:32:56] meh, i can redirect all the tftp install traffic to brewster, but its painful slow. [18:33:18] oh well, its gonna boot anyhow [18:34:03] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [18:34:03] PROBLEM - NTP on carbon is CRITICAL: NTP CRITICAL: Offset unknown [18:35:16] apergos / Reedy / yurik / andrewbogott [18:35:23] youguys on bast1001 [18:35:24] i wanna upgrade it. [18:35:30] I thought it [18:35:30] (bastions kinda important) [18:35:33] already [18:35:38] oh, its on the not done list [18:35:52] 3.2.0-43-generic [18:35:53] RobH, you can kick yurik and me [18:35:53] but yep, its done, nm [18:35:55] lol [18:35:56] happened yesterday [18:35:57] meh, no need [18:36:21] I know, I was on it at the time trying to merge on sockpuppet :-P [18:36:42] RobH, ok, I'll get out of the way [18:37:04] oops [18:37:07] andrewbogott: bast1001 already updated, it's all good [18:37:46] If I pushed a cluster config change affecting some fundraising stuff would I be stomping on anyone? [18:38:03] RECOVERY - NTP on carbon is OK: NTP OK: Offset -0.004752874374 secs [18:38:39] 1158 [18:39:10] apergos: what are you counting? :) [18:39:26] image scalers as they get rebooted :-D [18:39:39] I figure it's polite to mention em as they go down [18:39:46] ah [18:39:55] are you doing tampa too? [18:40:00] I will be, yep [18:40:03] PROBLEM - Host mw1158 is DOWN: PING CRITICAL - Packet loss = 100% [18:40:08] cool [18:40:10] assuming they need it (will check the kernel on em) [18:40:11] thanks :) [18:40:16] sure [18:40:23] hashar: gallium? [18:40:37] paravoid: I have asked Ariel to merge the pending change. [18:40:52] I don't see any [18:40:52] paravoid: puppet is fixed, I had to send a few more changes though [18:41:01] I'm not a reviewer most probably [18:41:07] yeah sorry skipped you :( [18:41:11] no worries [18:41:13] RECOVERY - Host mw1158 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [18:41:15] can we reboot it? [18:41:22] not at this time of the day [18:41:32] okay [18:41:48] !log mwalker synchronized wmf-config/CommonSettings.php 'Removing france as a redirect country for fundraising (7649276ddb3f20678de265c37fe75e1d5e642956)' [18:41:52] hashar: can we reboot gallium ? i wasn't sure what you meant with "carefully" doing it:) [18:41:56] Logged the message, Master [18:41:57] lets do it in roughly half an hour when SF folks get out for lunch [18:41:58] might be a little late now... but cna we also turn off hyperthreading on the boxes as we reboot them? [18:41:58] for zuul [18:42:19] the problem with gallium is that while it reboots jenkins / zuul do not respond [18:42:27] and I am not sure how jenkins will behave on restart :-] [18:42:30] specifically the mw* boxes [18:42:48] feel free to shout in the office desk that jenkins is going down :-] [18:43:01] on the image scalers? [18:43:04] notpeter: [18:43:11] and yeah it's a little late, bout done with this batch [18:43:19] ok, nvm [18:43:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64351 [18:43:40] do we care about the image scalers in tampa for that? [18:43:50] it should already be off on them [18:43:55] ok [18:44:05] mutante: paravoid: I am sending announcements about gallium going down. [18:46:10] 1159 [18:46:52] mutante: paravoid: lets reboot it at noon PST / 10pm (Greece time) [18:46:58] aka in 15 minutes [18:47:10] I'm going out [18:47:10] !log dist-upgrading calcium [18:47:19] Logged the message, Master [18:47:21] paravoid: will do it with mutante so :) [18:47:26] great, thanks [18:47:43] I'll be around for a bit yet in case there's a meltdown [18:47:54] but with intermittent cooking etc [18:48:03] PROBLEM - Host mw1159 is DOWN: PING CRITICAL - Packet loss = 100% [18:48:03] PROBLEM - RAID on analytics1018 is CRITICAL: Timeout while attempting connection [18:48:23] RECOVERY - Host mw1159 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [18:49:33] !caesium rebooting [18:49:39] !clog aesium rebooting [18:49:41] bah [18:49:43] PROBLEM - Host analytics1018 is DOWN: PING CRITICAL - Packet loss = 100% [18:49:48] !log caesium rebooting [18:49:55] !log i hate you morebots. [18:49:56] Logged the message, RobH [18:50:04] Logged the message, RobH [18:50:23] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection refused [18:51:23] RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [18:54:15] !log shutting down titanium for add'l disk [18:54:23] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.204 second response time [18:54:25] Logged the message, Master [18:54:32] cmjohnson1: hold up [18:54:38] k [18:54:39] if you havent [18:54:44] so we are also rebooting for kernels [18:54:50] oh... [18:54:52] okay [18:54:53] i rather keep titanium up right now, and upgrade cerium [18:54:55] it can wait [18:55:00] it'll be a few more minutes [18:55:04] so cerium is back? [18:55:06] !log dist-upgrading ekrem (IRC) [18:55:07] yes [18:55:14] Logged the message, Master [18:55:54] !log dist-upgrading all ssl boxes [18:55:59] !log cerium rebooting for kernel upgrade [18:56:02] Logged the message, Master [18:56:15] Logged the message, RobH [18:56:58] and 1160, last of the eqiad batch [18:58:43] PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100% [18:58:53] PROBLEM - Host mw1160 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:13] RECOVERY - Host mw1160 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [18:59:33] cmjohnson1: Ok, so cerium is rebooting [18:59:42] i want it back online, pooled, and at load before we kill titanium [18:59:42] !log gallium: stopping jenkins and zuul [18:59:43] PROBLEM - SSH on ekrem is CRITICAL: Connection refused [18:59:49] ok...i will check the logs [18:59:51] Logged the message, Master [19:00:13] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [19:00:29] mutante: zuul and jenkins stopping on gallium [19:00:43] RECOVERY - SSH on ekrem is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:01:53] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection refused [19:02:00] cmjohnson1: once its pooled isnt enough [19:02:13] PROBLEM - SSH on iron is CRITICAL: Connection refused [19:02:13] we want to ensure its carrying its normal load before we kill the other caching server [19:02:42] mutante: should I just use reboot or does it need to be done via the console ? :D [19:03:03] PROBLEM - HTTP on gallium is CRITICAL: Connection refused [19:03:07] !log wikimedia irc is back [19:03:09] robh: how do you ensure it carrying normal load? [19:03:10] hashar: doing it now [19:03:13] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [19:03:13] thx [19:03:14] Logged the message, Master [19:03:29] !log rebooting gallium [19:03:30] of course, the day after I decide to subscribe to the http://identi.ca/wikimediatech account to have an easy way to review the deploys, this happens, flooding my identi.ca feed [19:03:38] Logged the message, Master [19:03:50] !log eqiad imagescalers done [19:03:53] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.226 second response time [19:03:58] Logged the message, Master [19:04:25] cmjohnson1: Ok, so this is new to me, cuz its varnish [19:04:30] so take whatever i say with grain of salt. [19:04:41] noted [19:05:16] so, i am comparing the hit ratios of varnish on both cerium and titanium [19:05:19] with varnishstat [19:05:29] i see titanium is .96 [19:05:34] hashar: of course .. fsck :p [19:05:35] and cerium was in .5 [19:05:39] but now is back up to .9 [19:05:49] keeps going between .83 and .9 [19:05:50] heh heh [19:05:56] (just varnishstat on cli) [19:06:03] PROBLEM - NTP on caesium is CRITICAL: NTP CRITICAL: Offset unknown [19:06:09] so, caesium is pooled in pybal [19:06:13] RECOVERY - SSH on iron is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:06:21] mutante: that must happen from time to time [19:06:27] mw75 [19:06:47] cmjohnson1: Also, on the ipvsadm on lvs1003 [19:06:48] hashar: ack, it's the regular " has gone 325 days without" bla [19:06:56] shows same number of connections now between the two [19:07:33] basically i didnt want to kill titanium when cerium had a cold cache [19:07:43] but it appears ok now [19:08:13] ^ understood [19:08:13] PROBLEM - jenkins_service_running on gallium is CRITICAL: Connection refused by host [19:08:23] PROBLEM - SSH on gallium is CRITICAL: Connection refused [19:08:51] break, friends called to go to dinner [19:09:02] will finish up tampa imagescalers when I get back [19:09:17] cmjohnson1: so you are ok to take down titanium [19:09:24] when you finish, let me know so i can dist-upgrade it. [19:09:30] got it! [19:09:37] well, you can do it if you want [19:09:43] but i'd take down, install, then do the upgrade [19:09:49] okay [19:11:03] RECOVERY - NTP on caesium is OK: NTP OK: Offset 9.703636169e-05 secs [19:11:55] !log shutting titanium down to add disks (take 2) [19:12:04] Logged the message, Master [19:12:26] exit [19:12:30] heh, wrong window [19:14:03] PROBLEM - Host titanium is DOWN: PING CRITICAL - Packet loss = 100% [19:15:13] PROBLEM - RAID on analytics1020 is CRITICAL: Connection refused by host [19:15:52] !log holmium (blog server) rebooting for kernel upgrade [19:16:00] Logged the message, RobH [19:16:53] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:10] * RobH begins the sysadmin reboot mantra for critical servers [19:17:19] pleasedonediepleasedonediepleasedonediepleasedonediepleasedonediepleasedonediepleasedonediepleasedonediepleasedonediepleasedonedie [19:18:33] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:18:54] !log blog back up, whew. [19:19:03] Logged the message, RobH [19:19:58] hashar: that didnt work out right.. [19:20:05] no console output [19:20:11] still fsck ? [19:20:13] RECOVERY - Host titanium is UP: PING OK - Packet loss = 0%, RTA = 2.58 ms [19:20:17] �������� [19:20:22] <-- that's what i see [19:20:35] at least it ping so it is not entirely dead [19:20:43] maybe disconnect / reconnect the console? [19:20:52] i did, and even reset it [19:20:59] oh, no i get 1;-11;-1fUbuntu 12.041;-1f. [19:21:38] hold on ... [19:22:02] sees BIOS messages again [19:22:54] cmjohnson1: So, when you finish adding in the hard disks, you can go ahead and also do the kernel upgrade with apt-get dist-upgrade [19:23:08] and reboot, it will then load up in newer kernel [19:23:16] * Starting Jenkins Continuous Integration Server jenkins [ OK ] [19:23:18] and shoudl auto repool and such [19:23:23] RECOVERY - SSH on gallium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:23:25] gallium login: [19:24:03] RECOVERY - HTTP on gallium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 563 bytes in 0.002 second response time [19:24:14] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/bin/zuul-server [19:24:14] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [19:28:15] !log depooled ssl3/4 ssl3002/3 ssl1003/1004 [19:28:24] Logged the message, Master [19:28:56] !log rebooted ssl3/4 ssl3002/3 ssl1003/1004 [19:29:05] Logged the message, Master [19:29:19] a single ssl host handling all of the esams: http://ganglia.wikimedia.org/latest/?c=SSL%20cluster%20esams&h=ssl3001.esams.wikimedia.org&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [19:29:23] PROBLEM - Host ssl1003 is DOWN: CRITICAL - Host Unreachable (208.80.154.9) [19:29:33] PROBLEM - Host ssl3 is DOWN: PING CRITICAL - Packet loss = 100% [19:29:33] PROBLEM - Host ssl4 is DOWN: PING CRITICAL - Packet loss = 100% [19:29:44] bandwidth is a problem, but CPU and memory wise it's fine [19:29:49] !log restarted Jenkins [19:30:01] Logged the message, Master [19:31:05] !log repooling ssl3002/3 [19:31:12] Logged the message, Master [19:31:13] RECOVERY - Host ssl3 is UP: PING OK - Packet loss = 0%, RTA = 26.61 ms [19:31:13] RECOVERY - Host ssl1003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [19:31:20] Apparently IRC is broken [19:31:23] RECOVERY - Host ssl4 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [19:31:28] Server is up but you can't join channels [19:31:30] Krenair: yep. mutante is working on it [19:31:35] ok [19:33:43] PROBLEM - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:13] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [19:39:41] !log repooling ssl1003/4 ssl3/4 [19:39:49] Logged the message, Master [19:39:52] !log depooling ssl3001 ssl1/2 ssl1001/2 [19:40:00] Logged the message, Master [19:41:23] PROBLEM - DPKG on cp1016 is CRITICAL: Timeout while attempting connection [19:41:55] !log Jenkins restarted successfully. [19:42:04] Logged the message, Master [19:42:53] PROBLEM - Host cp1016 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:13] RECOVERY - DPKG on cp1016 is OK: All packages OK [19:44:23] RECOVERY - Host cp1016 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:44:43] PROBLEM - Host ssl2 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:43] PROBLEM - Host ssl1 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:53] PROBLEM - Host ssl3001 is DOWN: PING CRITICAL - Packet loss = 100% [19:45:23] RECOVERY - Host ssl3001 is UP: PING OK - Packet loss = 0%, RTA = 89.98 ms [19:45:43] RECOVERY - Host ssl1 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [19:45:46] !log rebooted ssl1/2 ssl3001 ssl1001/1002 [19:45:54] Logged the message, Master [19:48:24] binasher: the first graph on https://gdash.wikimedia.org/dashboards/jobq/ is a python backtrace [19:49:04] AaronSchulz: maybe that really is the metric [19:49:08] woah.... [19:51:03] notpeter: https://gerrit.wikimedia.org/r/#/c/55059/ [19:53:51] hm, gone [19:54:33] PROBLEM - NTP on titanium is CRITICAL: NTP CRITICAL: Offset unknown [19:54:40] !log Wikimedia IRC server working again [19:54:48] Logged the message, Master [19:57:04] PROBLEM - Host cp1017 is DOWN: PING CRITICAL - Packet loss = 100% [19:58:33] RECOVERY - NTP on titanium is OK: NTP OK: Offset 9.799003601e-05 secs [19:59:13] RECOVERY - Host cp1017 is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [20:04:53] New review: Ottomata; "Awesome!" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [20:08:09] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 20:08:00 UTC 2013 [20:08:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:09:19] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 20:09:09 UTC 2013 [20:09:27] Jeff_Green: so that 'Error connecting to db1025.eqiad.wmnet' is on someones todo list right? I just want to make sure it will get dealt with sometime [20:09:51] generally yeah, it's the purview of fr-tech [20:09:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:10:19] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 20:10:14 UTC 2013 [20:10:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:11:19] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 20:11:10 UTC 2013 [20:11:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:12:09] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 20:11:58 UTC 2013 [20:12:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:19] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 20:13:18 UTC 2013 [20:13:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:14:59] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Fri May 17 20:14:50 UTC 2013 [20:14:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:21:04] notpeter: what's the link to the db tree? [20:22:39] https://noc.wikimedia.org/dbtree/ [20:29:06] New patchset: Ottomata; "Puppetizing Hadoop for CDH4." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [20:35:28] yooo hashar! [20:35:35] you there? [20:35:38] ottomata: yes [20:36:02] hiya, akosiaris and I are wondering if we can get jenkins puppet linting for operations/puppet/* repos [20:36:15] (and are also curious about how you set that up) [20:36:33] sure [20:36:43] I should really write a how to :-] [20:38:03] ^demon: https://gerrit.wikimedia.org/r/#/c/64196/ would be nice too :) [20:38:17] * AaronSchulz wonders where binasher is [20:38:24] ottomata: yeah I think I can hack something [20:38:57] hashar: would it be possible to also call unit tests if present ? [20:39:05] ottomata: could you fill a bug about it under Wikimedia > Continuous Integration ? [20:39:09] i.e. a tests directory ? [20:39:14] ottomata: it is not that long to do but it is too late for now sorry [20:39:22] ook cool [20:39:22] can do [20:39:32] ottomata: basically I need to slightly overhaul the existing test and use some recent shell script I wrote [20:40:30] but basically the process is: run a shell script that find changed .pp | xargs to puppet parser validate [20:40:59] that will also let me get rid of the rake --validate being run right now by jenkins [20:41:14] akosiaris: yeah I noticed you added some tests on hadoop repo. [20:41:26] akosiaris: I am not sure how harmful it is though :-] [20:42:44] hashar, akosiaris: https://bugzilla.wikimedia.org/show_bug.cgi?id=48590 [20:43:22] why I am still writing shell scripts when I could use python [20:44:49] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [20:49:34] ottomata: well I am just hacking it right now [20:49:38] it is not too long afterall [20:51:00] ok cool! [20:55:11] hashar: wait, maybe i'm confused [20:55:17] aren't you already doing this for operations/puppet repo? [20:56:19] hashar: Since it's puppet, you could do that in ruby and save the pipe/fork to another process ;) But you're not that mental [20:57:31] Damianz: I am going to get rid of the ruby magic :-] note that ops/puppet has a rake file, one can do: rake validate [20:58:58] * Damianz gives hashar 1 cookie [21:00:49] New review: Reedy; "See https://bugzilla.wikimedia.org/show_bug.cgi?id=48589 FR noisy notices on dewiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48634 [21:01:31] ahh, hm [21:02:36] hmmm, hashar, i can add a rake file to this project [21:02:41] should I? [21:03:27] nop [21:03:31] I am writing the magic to get rid of it [21:03:59] PROBLEM - Host analytics1022 is DOWN: PING CRITICAL - Packet loss = 100% [21:05:13] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [21:05:43] PROBLEM - NTP on analytics1022 is CRITICAL: NTP CRITICAL: Offset unknown [21:06:42] mk [21:06:51] https://gerrit.wikimedia.org/r/#/c/64432/ :-] [21:07:09] * hashar waits for jenkins [21:07:46] oo, its gotta be manual then? [21:07:54] manual ? [21:07:55] can't wildcard operations/puppet/* [21:07:55] ? [21:08:07] no we cant [21:08:11] hmm, rats ok [21:08:22] the way I am doing it, there must be a job per repo [21:08:34] hm, ok [21:08:38] but maybe one day I will have just one linting job per language [21:08:43] RECOVERY - NTP on analytics1022 is OK: NTP OK: Offset 0.0005452632904 secs [21:08:46] I'm going to add links these changes to the bug report so that next time I create a module I can submit a change for you to review [21:08:57] well [21:09:04] ideally I should write an how to to scale [21:09:09] aye [21:09:14] so other people can create the jenkins jobs / zuul triggers [21:09:28] that is only me, timo and marktraceur for now [21:09:42] oh the links are already there :) [21:09:44] yeah [21:10:00] oh my l10n bot has kicked in [21:10:04] lemme know when those are in, and I'll submit another patchset to see how it goes [21:11:08] I knew I should have postponed that to monday :-] [21:11:11] but I am too nice [21:13:07] !log gallium Manually blackholed some web engine crawler (via ip route) [21:13:15] Logged the message, Master [21:16:06] pff [21:17:56] mw76 [21:18:27] ottomata: deployed [21:18:49] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/64434 [21:19:00] New patchset: Ottomata; "Puppetizing Hadoop for CDH4." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [21:19:37] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/64434 [21:19:53] PROBLEM - Host mw76 is DOWN: PING CRITICAL - Packet loss = 100% [21:20:12] k hashar, i just submitted a new patchset on cdh4 [21:20:18] should I see jenkins bot do linting? [21:20:23] RECOVERY - Host mw76 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [21:20:38] Project operations/puppet/kafka not found [21:20:39] grmblblb [21:21:33] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/64436 [21:22:02] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT).." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/64436 [21:22:14] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)." [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/64434 [21:22:28] I forgot to apply the conf change on zuul huhu [21:22:53] PROBLEM - Apache HTTP on mw76 is CRITICAL: Connection refused [21:23:53] RECOVERY - Apache HTTP on mw76 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.192 second response time [21:24:19] New review: Hashar; "recheck" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [21:24:48] Change abandoned: Hashar; "(no reason)" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/64436 [21:25:01] Change abandoned: Hashar; "(no reason)" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/64434 [21:25:17] New review: Hashar; "recheck" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [21:25:53] PROBLEM - Host stat1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:26:04] 77 [21:26:23] RECOVERY - Host stat1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [21:26:25] ottomata: you got erb and pp linter for cdh4 and hadoop repo. I have triggered the check by posting the comment 'recheck' on the changes https://gerrit.wikimedia.org/r/#/c/61710/ and https://gerrit.wikimedia.org/r/#/c/50385/ (ping akosiaris ) [21:27:05] though the pp does not work hehe [21:28:03] PROBLEM - Host mw77 is DOWN: PING CRITICAL - Packet loss = 100% [21:28:23] RECOVERY - Host mw77 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [21:28:51] oh cool! [21:28:58] I made a typo [21:29:16] the puppet parser validate command was being given the .erb files instead of the .pp ones [21:29:26] haha :) [21:30:17] thanks hashar! i know its late, much appreciated! [21:32:38] ah https://integration.wikimedia.org/ci/job/operations-puppet-cdh4-pplint-HEAD/4/console [21:32:39] fixed [21:33:14] yurik: fenari reboot pending [21:33:19] New review: Hashar; "recheck" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [21:33:21] 78 [21:33:25] New review: Hashar; "recheck" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [21:33:32] !log about to reboot fenari [21:33:41] Logged the message, Master [21:34:02] Aaron|laptop: fenari reboot pending [21:34:26] wait, how'd you do the recheck hashar? [21:34:31] you just make a 'recheck' comment? [21:34:47] yep, recheck comments [21:34:48] does that work everywhere? [21:34:52] yeah 'recheck' [21:34:54] cooool! [21:34:59] that retrigger the linting [21:35:08] to reitrgger unit tests you still have to send a new patchset [21:35:27] PROBLEM - Host mw78 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:39] ottomata: if that works for you, I will let you resolve https://bugzilla.wikimedia.org/show_bug.cgi?id=48590 [21:35:47] RECOVERY - Host mw78 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [21:35:47] PROBLEM - NTP on mw78 is CRITICAL: NTP CRITICAL: Offset unknown [21:36:05] ottomata: if there is any issue, that will have to wait later on; Drop me an email on hashar @ free . fr and or amusso @ wikimedia.org and I will fix it whenever I can :] [21:36:09] eees good [21:36:09] danke [21:36:29] I forgot the 'merge' job [21:36:40] so if the patchset does not apply against latest master, it will not complain properly [21:36:46] and the lint job will fail miserably :-] [21:36:50] oh hm ok [21:37:12] I will reopen bug [21:38:06] we wil have to talk about puppet unit / integration tests one day :] [21:38:37] PROBLEM - Host fenari is DOWN: PING CRITICAL - Packet loss = 100% [21:39:22] ok cool [21:39:36] I am out for bed :-] [21:39:37] RECOVERY - Host fenari is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [21:39:41] enjoy the module [21:39:47] RECOVERY - NTP on mw78 is OK: NTP OK: Offset 0.0007004737854 secs [21:39:54] we can have a look at generating puppet doc for it and publish that under doc.wikimedia.org [21:40:01] yet another bug hehe [21:40:05] *wave* [21:40:06] haha, cool! [21:40:09] ok, thanks again! [21:40:12] have good sleep! [21:40:15] 79 [21:40:58] later hashar [21:41:04] apergos: thanks for all the jenkins madness earlier today [21:41:16] apergos: I got a slave running on gallium now :-] [21:41:19] excellent [21:41:38] I'm just upgrading a couple more hosts and that will be it for the day [21:41:38] that let me prepare the work for the next server :) [21:41:54] so what's the next server? [21:42:07] PROBLEM - Host mw79 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:31] lies, I'm on it right now [21:42:34] apergos: not sure yet, it is being prepared [21:42:37] RECOVERY - Host mw79 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [21:42:48] ok [21:43:17] sleeping [21:46:05] I wish I could sleep that easily [21:46:06] 80 [21:46:27] <^demon> I can go to sleep instantly. [21:47:37] PROBLEM - Host mw80 is DOWN: PING CRITICAL - Packet loss = 100% [21:48:22] oh, apergos, the numbers you're spouting are mws, I thought you were missing the "/win" part of a irssi command [21:48:36] * greg-g catches up [21:48:42] !log DNS update - kill storage1 and 2 [21:48:51] Logged the message, Master [21:50:27] PROBLEM - mysqld processes on db1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:50:55] yeah, so that when people see they are down they don't wonder what's going on [21:52:11] * greg-g nods [21:52:49] !log dist-upgrading zhen (mobile vumi) [21:52:57] Logged the message, Master [21:56:52] yay mw80 memtesst failure dimm1 [21:57:08] yay zhen dpkg failure pyhton-iso8601 [21:57:38] !log mw80 memtest failure dimm1, all other image scalers in pmtpa updated [21:57:46] Logged the message, Master [21:59:18] and ticket created, time for bed [21:59:34] thanks for all the upgrading apergos, good night [21:59:48] good night! [22:00:16] https://vo.wikipedia.org/wiki/Cifapad [22:00:21] this is surprising :) [22:00:27] RECOVERY - Host ssl2 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [22:00:41] [22:02:40] !log reedy synchronized wmf-config/flaggedrevs.php [22:02:47] binasher: I could use job runner profiling now, rarr [22:02:49] Logged the message, Master [22:03:14] apergos: m80 when kaboom? [22:03:15] New patchset: Reedy; "FR noisy notices on dewiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64446 [22:03:19] *went [22:03:28] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64446 [22:03:50] uh huh. and it will stay that way til someone does something about the memory in it I guess [22:04:10] and now really gone... [22:04:17] PROBLEM - Host lanthanum is DOWN: CRITICAL - Host Unreachable (208.80.154.13) [22:04:58] PROBLEM - Host zhen is DOWN: PING CRITICAL - Packet loss = 100% [22:05:10] apergos: http://en.wikipedia.org/wiki/M-80_%28explosive%29 [22:05:18] RECOVERY - Host lanthanum is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [22:05:24] Can someone please rm -rf /a/commonphp-1.21wmf12 on tin for me? [22:05:44] New patchset: Reedy; "Kill 1.21wmf11 and 1.21wmf12" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64448 [22:05:58] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64448 [22:06:24] /a/common/php-1.21wmf12 [22:06:38] RECOVERY - Host zhen is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [22:07:28] !log rm -rf php-1.21wmf12 on tin per reedy [22:07:31] Reedy: done [22:07:33] Reedy: oh, speaking of perms, did yurik's permission issue from last night (well, last night my time) get resolved? [22:07:34] thanks [22:07:36] Logged the message, Master [22:07:50] Yeah, mark did it earlier for me [22:07:53] !log reedy synchronized docroot [22:08:01] greg-g: Noting I reported it a few hours before he did ;) [22:08:01] Logged the message, Master [22:08:16] Reedy: right, that. ;) [22:09:40] !log reedy synchronized w [22:09:48] Logged the message, Master [22:12:37] New patchset: Reedy; "Sync w at the same time as docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64449 [22:13:05] whilst poking around in the dberror log -- I've seen a lot of "Fri May 17 22:12:26 UTC 2013 mw104 enwiki Error connecting to 10.0.6.81: Can't connect to MySQL server on '10.0.6.81' (4)" [22:13:28] db71 [22:13:46] kk -- so known abouit [22:13:54] I've no idea :p [22:14:03] ah [22:14:03] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [22:14:03] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [22:14:04] !log rebooting spence for upgrades [22:14:04] was just pasting it incase others have the same query [22:14:13] Logged the message, Master [22:14:18] It's a pmtpa database host... [22:14:39] eeeeviiiiil [22:15:08] PROBLEM - Host spence is DOWN: PING CRITICAL - Packet loss = 100% [22:15:09] What's even still running mw stuffs in pmtpa? [22:15:57] clearly something on enwiki [22:16:56] the only references I have in config are in wmf-config/db-pmtpa.php [22:17:00] but that seems totally reasonable [22:18:29] !!log running hotbackup of db71 to pre-labsdb for s1 [22:18:29] petan needs a new hobby :P [22:18:49] lol [22:18:51] 31016 apache 20 0 572m 64m 33m S 6 0.1 0:11.24 apache2 [22:19:01] !!test [22:19:32] !!log [22:19:32] petan needs a new hobby :P [22:19:38] heh [22:19:50] I wonder if it's icinga doing healthchecks against the apaches.. [22:20:18] RECOVERY - Host spence is UP: PING OK - Packet loss = 0%, RTA = 26.86 ms [22:20:37] interesting -- !!log doesn't appear to actually log anything [22:20:44] spence, can have ssh? [22:21:04] !log [23:18:29] !!log running hotbackup of db71 to pre-labsdb for s1 [22:21:13] Logged the message, Master [22:21:44] Reedy: thanks, heh [22:22:14] !log!!logdoesntlog [22:22:28] PROBLEM - SSH on spence is CRITICAL: Connection refused [22:22:32] !log !!log doesn't log [22:22:41] Logged the message, Master [22:24:27] * mutante kicks spence [22:24:58] PROBLEM - DPKG on cp1018 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:28:08] PROBLEM - Host cp1018 is DOWN: PING CRITICAL - Packet loss = 100% [22:28:35] /dev/mapper/spence-root has gone 585 days without being checked [22:29:38] RECOVERY - Host cp1018 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [22:29:58] RECOVERY - DPKG on cp1018 is OK: All packages OK [22:30:38] PROBLEM - DPKG on cp1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:31:38] RECOVERY - DPKG on cp1003 is OK: All packages OK [22:33:41] !log shutting down zinc [22:33:50] Logged the message, Master [22:34:20] LeslieCarr: mark: I thought this networking related content dispute might interest you :) https://commons.wikimedia.org/wiki/Commons:Deletion_requests/File:OME-100G_Module.jpg [22:35:46] PROBLEM - Host zinc is DOWN: PING CRITICAL - Packet loss = 100% [22:37:05] New patchset: RobH; "zinc needs wipe and removal from service for later reuse" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64455 [22:38:26] RECOVERY - SSH on spence is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:38:58] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64455 [22:39:46] PROBLEM - Host cp1003 is DOWN: PING CRITICAL - Packet loss = 100% [22:39:46] PROBLEM - Host cp1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:39:56] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [22:40:46] RECOVERY - Host cp1002 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [22:41:22] RECOVERY - Host cp1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [22:46:51] New patchset: Asher; "rbr for prelabs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64456 [22:47:21] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64456 [22:52:06] PROBLEM - Host cp1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:53:26] RECOVERY - Host cp1001 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [22:54:27] !log dist-upgrading capella ( IPv6 tunnel relay) [22:54:35] Logged the message, Master [22:57:24] !log upgrading and rebooting cp1001-1020 (3 at a time) [22:57:32] Logged the message, Master [23:01:48] !log using salt to dist-upgrade all the tampa apaches... nothing could go wrong....right? [23:01:54] :)) [23:01:57] Logged the message, RobH [23:02:04] sounds fun [23:02:46] PROBLEM - Host cp1006 is DOWN: PING CRITICAL - Packet loss = 100% [23:02:56] PROBLEM - Host cp1004 is DOWN: PING CRITICAL - Packet loss = 100% [23:02:56] PROBLEM - Host cp1005 is DOWN: PING CRITICAL - Packet loss = 100% [23:04:16] RECOVERY - Host cp1006 is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms [23:04:16] RECOVERY - Host cp1005 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [23:04:36] RECOVERY - Host cp1004 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [23:04:49] !log dist-upgrading chromium [23:04:58] Logged the message, Master [23:05:14] PROBLEM - NTP on cp1005 is CRITICAL: NTP CRITICAL: Offset unknown [23:05:14] PROBLEM - NTP on cp1006 is CRITICAL: NTP CRITICAL: Offset unknown [23:05:24] PROBLEM - NTP on cp1004 is CRITICAL: NTP CRITICAL: Offset unknown [23:06:54] PROBLEM - DPKG on mw6 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:07:24] PROBLEM - DPKG on mw37 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:07:24] PROBLEM - DPKG on mw30 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:07:34] PROBLEM - DPKG on mw44 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:07:34] PROBLEM - DPKG on mw49 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:07:34] PROBLEM - DPKG on mw33 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:07:34] PROBLEM - DPKG on mw72 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:07:34] PROBLEM - DPKG on mw27 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:07:35] PROBLEM - DPKG on mw47 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:07:54] PROBLEM - DPKG on mw43 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:07:54] PROBLEM - Apache HTTP on mw112 is CRITICAL: Connection refused [23:08:14] PROBLEM - Apache HTTP on mw102 is CRITICAL: Connection refused [23:08:27] PROBLEM - Apache HTTP on mw104 is CRITICAL: Connection refused [23:08:27] RECOVERY - NTP on cp1004 is OK: NTP OK: Offset -0.09772586823 secs [23:08:34] RECOVERY - DPKG on mw49 is OK: All packages OK [23:08:34] RECOVERY - DPKG on mw72 is OK: All packages OK [23:08:44] RECOVERY - NTP on cp1005 is OK: NTP OK: Offset -0.02103590965 secs [23:08:54] RECOVERY - DPKG on mw43 is OK: All packages OK [23:09:04] PROBLEM - Apache HTTP on mw30 is CRITICAL: Connection refused [23:09:14] PROBLEM - Apache HTTP on mw44 is CRITICAL: Connection refused [23:09:24] PROBLEM - Apache HTTP on mw49 is CRITICAL: Connection refused [23:09:24] RECOVERY - DPKG on mw37 is OK: All packages OK [23:09:24] RECOVERY - DPKG on mw30 is OK: All packages OK [23:09:25] PROBLEM - Apache HTTP on mw33 is CRITICAL: Connection refused [23:09:25] PROBLEM - Apache HTTP on mw43 is CRITICAL: Connection refused [23:09:34] PROBLEM - Apache HTTP on mw72 is CRITICAL: Connection refused [23:09:34] PROBLEM - Apache HTTP on mw37 is CRITICAL: Connection refused [23:09:34] RECOVERY - DPKG on mw44 is OK: All packages OK [23:09:34] RECOVERY - DPKG on mw33 is OK: All packages OK [23:09:34] RECOVERY - DPKG on mw27 is OK: All packages OK [23:09:35] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [23:09:35] RECOVERY - DPKG on mw47 is OK: All packages OK [23:10:54] RECOVERY - DPKG on mw6 is OK: All packages OK [23:11:04] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.113 second response time [23:11:15] !log still doing pmtpa mw upgrades, ignore all icinga alarms for now [23:11:24] Logged the message, RobH [23:12:24] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.111 second response time [23:12:40] !log dist-upgrading hydrogen, manutius [23:12:49] Logged the message, Master [23:13:01] !log rebooting all pmtpa mw servers [23:13:11] Logged the message, RobH [23:13:14] RECOVERY - NTP on cp1006 is OK: NTP OK: Offset -0.001081585884 secs [23:13:54] RECOVERY - Apache HTTP on mw112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.139 second response time [23:15:04] PROBLEM - Host mw17 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:04] PROBLEM - Host mw88 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:04] PROBLEM - Host mw7 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:04] PROBLEM - Host mw73 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:04] PROBLEM - Host mw74 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:04] PROBLEM - Host mw16 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:10] icinga flood is going to continue. [23:15:14] PROBLEM - Host mw15 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:14] PROBLEM - Host mw18 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:14] PROBLEM - Host mw19 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:14] PROBLEM - Host mw21 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:14] PROBLEM - Host mw28 is DOWN: PING CRITICAL - Packet loss = 100% [23:16:24] RECOVERY - Host mw34 is UP: PING OK - Packet loss = 0%, RTA = 27.99 ms [23:16:24] RECOVERY - Host mw77 is UP: PING OK - Packet loss = 0%, RTA = 27.06 ms [23:16:24] RECOVERY - Host mw79 is UP: PING OK - Packet loss = 0%, RTA = 27.15 ms [23:16:24] RECOVERY - Host mw42 is UP: PING OK - Packet loss = 0%, RTA = 26.77 ms [23:16:24] RECOVERY - Host mw20 is UP: PING OK - Packet loss = 0%, RTA = 27.05 ms [23:17:24] RECOVERY - LVS HTTP IPv4 on api.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2759 bytes in 0.145 second response time [23:17:34] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63583 bytes in 0.540 second response time [23:17:54] PROBLEM - Host cp1008 is DOWN: PING CRITICAL - Packet loss = 100% [23:18:04] RECOVERY - Host cp1009 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [23:18:24] PROBLEM - Apache HTTP on mw34 is CRITICAL: Connection refused [23:18:24] PROBLEM - Apache HTTP on mw81 is CRITICAL: Connection refused [23:18:24] PROBLEM - Apache HTTP on mw108 is CRITICAL: Connection refused [23:18:24] PROBLEM - Apache HTTP on mw70 is CRITICAL: Connection refused [23:18:24] PROBLEM - Apache HTTP on mw115 is CRITICAL: Connection refused [23:18:25] PROBLEM - Apache HTTP on mw35 is CRITICAL: Connection refused [23:18:25] PROBLEM - Apache HTTP on mw43 is CRITICAL: Connection refused [23:18:26] PROBLEM - Apache HTTP on mw92 is CRITICAL: Connection refused [23:18:26] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [23:18:34] PROBLEM - Apache HTTP on mw22 is CRITICAL: Connection refused [23:18:34] PROBLEM - Apache HTTP on mw17 is CRITICAL: Connection refused [23:18:34] PROBLEM - Apache HTTP on mw25 is CRITICAL: Connection refused [23:18:34] PROBLEM - Apache HTTP on mw113 is CRITICAL: Connection refused [23:18:34] PROBLEM - Apache HTTP on mw74 is CRITICAL: Connection refused [23:18:35] PROBLEM - Apache HTTP on mw40 is CRITICAL: Connection refused [23:18:35] PROBLEM - Apache HTTP on mw114 is CRITICAL: Connection refused [23:18:36] PROBLEM - Apache HTTP on mw78 is CRITICAL: Connection refused [23:18:36] PROBLEM - Apache HTTP on mw88 is CRITICAL: Connection refused [23:18:37] PROBLEM - Apache HTTP on mw119 is CRITICAL: Connection refused [23:18:37] PROBLEM - Apache HTTP on mw91 is CRITICAL: Connection refused [23:18:38] PROBLEM - Apache HTTP on mw53 is CRITICAL: Connection refused [23:18:38] PROBLEM - Apache HTTP on mw107 is CRITICAL: Connection refused [23:18:39] PROBLEM - Apache HTTP on mw121 is CRITICAL: Connection refused [23:18:39] PROBLEM - Apache HTTP on mw120 is CRITICAL: Connection refused [23:18:40] PROBLEM - Apache HTTP on mw20 is CRITICAL: Connection refused [23:18:40] PROBLEM - Apache HTTP on mw87 is CRITICAL: Connection refused [23:18:41] PROBLEM - Apache HTTP on mw90 is CRITICAL: Connection refused [23:18:41] PROBLEM - Apache HTTP on mw62 is CRITICAL: Connection refused [23:18:44] PROBLEM - Apache HTTP on mw68 is CRITICAL: Connection refused [23:18:44] PROBLEM - Apache HTTP on mw63 is CRITICAL: Connection refused [23:18:44] PROBLEM - Apache HTTP on mw48 is CRITICAL: Connection refused [23:18:44] PROBLEM - Apache HTTP on mw123 is CRITICAL: Connection refused [23:18:44] PROBLEM - Apache HTTP on mw97 is CRITICAL: Connection refused [23:18:45] PROBLEM - Apache HTTP on mw79 is CRITICAL: Connection refused [23:18:45] PROBLEM - Apache HTTP on mw47 is CRITICAL: Connection refused [23:18:46] PROBLEM - Apache HTTP on mw66 is CRITICAL: Connection refused [23:18:54] PROBLEM - Apache HTTP on mw103 is CRITICAL: Connection refused [23:18:54] PROBLEM - Apache HTTP on mw61 is CRITICAL: Connection refused [23:18:54] PROBLEM - Apache HTTP on mw84 is CRITICAL: Connection refused [23:18:54] PROBLEM - Apache HTTP on mw71 is CRITICAL: Connection refused [23:18:54] PROBLEM - Apache HTTP on mw105 is CRITICAL: Connection refused [23:18:55] PROBLEM - Apache HTTP on mw118 is CRITICAL: Connection refused [23:18:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:18:56] PROBLEM - Apache HTTP on mw112 is CRITICAL: Connection refused [23:18:56] PROBLEM - Apache HTTP on mw64 is CRITICAL: Connection refused [23:19:04] PROBLEM - Apache HTTP on mw100 is CRITICAL: Connection refused [23:19:04] PROBLEM - Apache HTTP on mw69 is CRITICAL: Connection refused [23:19:04] PROBLEM - Apache HTTP on mw124 is CRITICAL: Connection refused [23:19:04] PROBLEM - Apache HTTP on mw122 is CRITICAL: Connection refused [23:19:04] PROBLEM - Apache HTTP on mw19 is CRITICAL: Connection refused [23:19:05] PROBLEM - Apache HTTP on mw46 is CRITICAL: Connection refused [23:19:05] PROBLEM - Apache HTTP on mw52 is CRITICAL: Connection refused [23:19:06] PROBLEM - Apache HTTP on mw30 is CRITICAL: Connection refused [23:19:06] PROBLEM - Host manutius is DOWN: PING CRITICAL - Packet loss = 100% [23:19:14] PROBLEM - Apache HTTP on mw65 is CRITICAL: Connection refused [23:19:14] PROBLEM - Apache HTTP on mw67 is CRITICAL: Connection refused [23:19:14] PROBLEM - Apache HTTP on mw75 is CRITICAL: Connection refused [23:19:14] PROBLEM - Apache HTTP on mw29 is CRITICAL: Connection refused [23:19:14] PROBLEM - Apache HTTP on mw83 is CRITICAL: Connection refused [23:19:15] PROBLEM - Apache HTTP on mw26 is CRITICAL: Connection refused [23:19:15] PROBLEM - Apache HTTP on mw77 is CRITICAL: Connection refused [23:19:16] PROBLEM - Apache HTTP on mw82 is CRITICAL: Connection refused [23:19:16] PROBLEM - Apache HTTP on mw86 is CRITICAL: Connection refused [23:19:17] PROBLEM - Apache HTTP on mw28 is CRITICAL: Connection refused [23:19:17] PROBLEM - Apache HTTP on mw116 is CRITICAL: Connection refused [23:19:18] PROBLEM - Apache HTTP on mw56 is CRITICAL: Connection refused [23:19:18] PROBLEM - Apache HTTP on mw21 is CRITICAL: Connection refused [23:19:19] PROBLEM - Apache HTTP on mw93 is CRITICAL: Connection refused [23:19:19] PROBLEM - Apache HTTP on mw36 is CRITICAL: Connection refused [23:19:20] PROBLEM - Apache HTTP on mw24 is CRITICAL: Connection refused [23:19:24] PROBLEM - Apache HTTP on mw55 is CRITICAL: Connection refused [23:19:24] PROBLEM - Apache HTTP on mw94 is CRITICAL: Connection refused [23:19:24] PROBLEM - Apache HTTP on mw32 is CRITICAL: Connection refused [23:19:24] PROBLEM - Apache HTTP on mw39 is CRITICAL: Connection refused [23:19:24] PROBLEM - Apache HTTP on mw109 is CRITICAL: Connection refused [23:19:25] PROBLEM - Apache HTTP on mw73 is CRITICAL: Connection refused [23:19:25] PROBLEM - Apache HTTP on mw42 is CRITICAL: Connection refused [23:19:26] PROBLEM - Apache HTTP on mw18 is CRITICAL: Connection refused [23:19:26] PROBLEM - Apache HTTP on mw125 is CRITICAL: Connection refused [23:19:27] PROBLEM - Apache HTTP on mw111 is CRITICAL: Connection refused [23:19:27] PROBLEM - Apache HTTP on mw45 is CRITICAL: Connection refused [23:19:28] PROBLEM - Apache HTTP on mw96 is CRITICAL: Connection refused [23:19:28] RECOVERY - Host cp1007 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [23:19:29] RECOVERY - Host cp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [23:19:29] PROBLEM - Apache HTTP on mw106 is CRITICAL: Connection refused [23:19:30] PROBLEM - Apache HTTP on mw58 is CRITICAL: Connection refused [23:19:30] PROBLEM - Apache HTTP on mw23 is CRITICAL: Connection refused [23:19:31] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [23:19:31] PROBLEM - Apache HTTP on mw85 is CRITICAL: Connection refused [23:19:32] PROBLEM - Apache HTTP on mw117 is CRITICAL: Connection refused [23:19:32] PROBLEM - Apache HTTP on mw95 is CRITICAL: Connection refused [23:19:33] PROBLEM - Apache HTTP on mw38 is CRITICAL: Connection refused [23:19:33] PROBLEM - Apache HTTP on mw60 is CRITICAL: Connection refused [23:19:34] PROBLEM - Apache HTTP on mw51 is CRITICAL: Connection refused [23:19:34] PROBLEM - Apache HTTP on mw41 is CRITICAL: Connection refused [23:19:35] PROBLEM - Apache HTTP on mw54 is CRITICAL: Connection refused [23:19:35] PROBLEM - Apache HTTP on mw76 is CRITICAL: Connection refused [23:19:36] PROBLEM - Apache HTTP on mw101 is CRITICAL: Connection refused [23:19:36] PROBLEM - Apache HTTP on mw59 is CRITICAL: Connection refused [23:19:37] PROBLEM - Apache HTTP on mw99 is CRITICAL: Connection refused [23:19:44] PROBLEM - Apache HTTP on mw89 is CRITICAL: Connection refused [23:20:24] PROBLEM - Frontend Squid HTTP on cp1009 is CRITICAL: Connection refused [23:20:24] PROBLEM - Backend Squid HTTP on cp1009 is CRITICAL: Connection refused [23:20:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.870 second response time [23:21:34] PROBLEM - Backend Squid HTTP on cp1007 is CRITICAL: Connection refused [23:21:34] PROBLEM - Frontend Squid HTTP on cp1008 is CRITICAL: Connection refused [23:21:44] RECOVERY - Host manutius is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [23:22:04] PROBLEM - Frontend Squid HTTP on cp1007 is CRITICAL: Connection refused [23:22:14] PROBLEM - Backend Squid HTTP on cp1008 is CRITICAL: Connection refused [23:22:24] RECOVERY - Frontend Squid HTTP on cp1009 is OK: HTTP OK: HTTP/1.0 200 OK - 1283 bytes in 0.004 second response time [23:22:34] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: Connection refused [23:22:36] RECOVERY - Frontend Squid HTTP on cp1008 is OK: HTTP OK: HTTP/1.0 200 OK - 1283 bytes in 0.004 second response time [23:23:14] RECOVERY - Backend Squid HTTP on cp1008 is OK: HTTP OK: HTTP/1.0 200 OK - 1250 bytes in 0.005 second response time [23:23:24] RECOVERY - Backend Squid HTTP on cp1009 is OK: HTTP OK: HTTP/1.0 200 OK - 1257 bytes in 0.005 second response time [23:23:34] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63583 bytes in 0.529 second response time [23:24:04] RECOVERY - Frontend Squid HTTP on cp1007 is OK: HTTP OK: HTTP/1.0 200 OK - 1290 bytes in 0.001 second response time [23:24:14] RECOVERY - Apache HTTP on mw75 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.150 second response time [23:24:34] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.163 second response time [23:24:34] RECOVERY - Apache HTTP on mw76 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.405 second response time [23:24:34] RECOVERY - Backend Squid HTTP on cp1007 is OK: HTTP OK: HTTP/1.0 200 OK - 1250 bytes in 0.009 second response time [23:24:34] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.142 second response time [23:24:41] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 63581 bytes in 1.661 second response time [23:24:41] RECOVERY - Apache HTTP on mw113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.010 second response time [23:24:44] RECOVERY - Apache HTTP on mw63 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.133 second response time [23:24:44] RECOVERY - Apache HTTP on mw79 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.386 second response time [23:24:54] RECOVERY - Apache HTTP on mw112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.351 second response time [23:25:04] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.141 second response time [23:25:14] RECOVERY - Apache HTTP on mw77 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.134 second response time [23:25:14] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.133 second response time [23:25:24] RECOVERY - Apache HTTP on mw104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.137 second response time [23:25:24] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.143 second response time [23:25:34] RECOVERY - Apache HTTP on mw78 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.340 second response time [23:25:44] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.126 second response time [23:26:04] RECOVERY - Apache HTTP on mw122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.362 second response time [23:26:04] PROBLEM - Host cp1010 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:04] PROBLEM - Host cp1011 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:14] RECOVERY - Apache HTTP on mw67 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.383 second response time [23:26:24] RECOVERY - Apache HTTP on mw23 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.148 second response time [23:26:34] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.147 second response time [23:26:34] PROBLEM - DPKG on labstore2 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:34] RECOVERY - Apache HTTP on mw120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.252 second response time [23:26:42] New patchset: Odder; "(bug 48578) Enable LQT for all namespaces on ptwikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64460 [23:26:44] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.297 second response time [23:26:44] RECOVERY - Apache HTTP on mw123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.351 second response time [23:26:44] RECOVERY - Apache HTTP on mw97 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.623 second response time [23:26:54] RECOVERY - Apache HTTP on mw103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.140 second response time [23:26:54] PROBLEM - Host cp1012 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:04] RECOVERY - Host cp1010 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [23:27:04] RECOVERY - Host cp1011 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [23:27:14] RECOVERY - Apache HTTP on mw83 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.155 second response time [23:27:14] RECOVERY - Apache HTTP on mw82 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.156 second response time [23:27:14] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.142 second response time [23:27:14] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.173 second response time [23:27:16] !log dist-upgrading gurvin [23:27:24] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.140 second response time [23:27:24] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.147 second response time [23:27:24] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.147 second response time [23:27:24] RECOVERY - Apache HTTP on mw94 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.169 second response time [23:27:24] RECOVERY - Apache HTTP on mw81 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.142 second response time [23:27:24] Logged the message, Master [23:27:25] RECOVERY - Apache HTTP on mw85 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.168 second response time [23:27:25] RECOVERY - Apache HTTP on mw96 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.724 second response time [23:27:26] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.133 second response time [23:27:26] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.137 second response time [23:27:27] RECOVERY - Apache HTTP on mw95 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.170 second response time [23:27:27] RECOVERY - Apache HTTP on mw108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.978 second response time [23:27:28] RECOVERY - Apache HTTP on mw92 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.142 second response time [23:27:28] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.137 second response time [23:27:34] RECOVERY - Apache HTTP on mw88 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.137 second response time [23:27:34] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.148 second response time [23:27:34] RECOVERY - Apache HTTP on mw114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.128 second response time [23:27:34] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.140 second response time [23:27:34] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.147 second response time [23:27:35] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.147 second response time [23:27:35] RECOVERY - Apache HTTP on mw107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.144 second response time [23:27:36] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.304 second response time [23:27:36] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.347 second response time [23:27:37] RECOVERY - Apache HTTP on mw119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.356 second response time [23:27:37] RECOVERY - DPKG on labstore2 is OK: All packages OK [23:27:38] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.148 second response time [23:27:38] RECOVERY - Apache HTTP on mw87 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.163 second response time [23:27:39] RECOVERY - Apache HTTP on mw121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.388 second response time [23:27:39] RECOVERY - Apache HTTP on mw99 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.126 second response time [23:27:40] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.264 second response time [23:27:40] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.503 second response time [23:27:41] RECOVERY - Apache HTTP on mw101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.864 second response time [23:27:41] RECOVERY - Apache HTTP on mw62 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.315 second response time [23:27:42] RECOVERY - Apache HTTP on mw90 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.309 second response time [23:27:44] RECOVERY - Apache HTTP on mw89 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.142 second response time [23:27:44] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.146 second response time [23:27:44] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.363 second response time [23:27:54] RECOVERY - Apache HTTP on mw118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.120 second response time [23:27:54] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.118 second response time [23:27:54] RECOVERY - Apache HTTP on mw105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.128 second response time [23:27:54] RECOVERY - Apache HTTP on mw84 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.162 second response time [23:27:54] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.101 second response time [23:27:55] RECOVERY - Apache HTTP on mw64 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.328 second response time [23:28:04] RECOVERY - Apache HTTP on mw100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.161 second response time [23:28:04] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.126 second response time [23:28:04] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.136 second response time [23:28:04] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.144 second response time [23:28:04] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.344 second response time [23:28:05] RECOVERY - Apache HTTP on mw124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.346 second response time [23:28:14] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.136 second response time [23:28:14] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.134 second response time [23:28:14] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.120 second response time [23:28:14] RECOVERY - Apache HTTP on mw102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.129 second response time [23:28:14] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.144 second response time [23:28:15] RECOVERY - Apache HTTP on mw86 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.142 second response time [23:28:15] RECOVERY - Apache HTTP on mw116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.128 second response time [23:28:16] RECOVERY - Apache HTTP on mw56 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.139 second response time [23:28:16] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.148 second response time [23:28:17] RECOVERY - Apache HTTP on mw93 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.152 second response time [23:28:24] RECOVERY - Apache HTTP on mw125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.117 second response time [23:28:24] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.117 second response time [23:28:24] RECOVERY - Apache HTTP on mw109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.153 second response time [23:28:24] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.140 second response time [23:28:24] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.151 second response time [23:28:25] RECOVERY - Apache HTTP on mw111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.128 second response time [23:28:25] RECOVERY - Host cp1012 is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [23:28:26] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.124 second response time [23:28:26] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.134 second response time [23:28:27] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.121 second response time [23:28:27] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.114 second response time [23:28:28] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.117 second response time [23:28:28] RECOVERY - Apache HTTP on mw106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.127 second response time [23:28:29] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.151 second response time [23:28:29] RECOVERY - Apache HTTP on mw117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.365 second response time [23:28:30] RECOVERY - Apache HTTP on mw115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.128 second response time [23:28:30] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.124 second response time [23:28:34] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.128 second response time [23:28:34] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.151 second response time [23:28:34] RECOVERY - Apache HTTP on mw91 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.123 second response time [23:32:34] PROBLEM - DPKG on cp1013 is CRITICAL: Timeout while attempting connection [23:32:45] !log upgrading all srv*.pmtpa.wmnet via dist-upgrade in salt. [23:32:53] Logged the message, RobH [23:32:55] ok folks, get ready for another icinga storm. [23:33:44] PROBLEM - Host cp1014 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:54] PROBLEM - Host cp1013 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:54] PROBLEM - Host cp1015 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:54] RECOVERY - Host cp1014 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [23:35:24] RECOVERY - Host cp1015 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [23:35:24] RECOVERY - Host cp1013 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [23:35:24] RECOVERY - DPKG on cp1013 is OK: All packages OK [23:35:38] PROBLEM - DPKG on srv279 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:35:38] PROBLEM - DPKG on srv286 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:35:38] PROBLEM - DPKG on srv259 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:35:38] PROBLEM - DPKG on srv269 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:35:44] PROBLEM - DPKG on srv267 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:35:44] PROBLEM - DPKG on srv291 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:35:54] PROBLEM - DPKG on srv262 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:35:54] PROBLEM - DPKG on srv284 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:04] PROBLEM - DPKG on srv260 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:14] PROBLEM - DPKG on srv258 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:14] PROBLEM - DPKG on srv280 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:24] PROBLEM - DPKG on srv282 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:24] PROBLEM - DPKG on srv287 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:24] PROBLEM - DPKG on srv288 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:24] PROBLEM - DPKG on srv270 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:28] PROBLEM - Apache HTTP on srv241 is CRITICAL: Connection refused [23:37:46] New patchset: Alex Monk; "Change link in notifyNewProjects to HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64462 [23:37:48] RECOVERY - DPKG on srv262 is OK: All packages OK [23:37:58] PROBLEM - Apache HTTP on srv282 is CRITICAL: Connection refused [23:37:58] RECOVERY - DPKG on srv260 is OK: All packages OK [23:38:18] PROBLEM - Apache HTTP on srv270 is CRITICAL: Connection refused [23:38:18] PROBLEM - Apache HTTP on srv262 is CRITICAL: Connection refused [23:38:18] RECOVERY - DPKG on srv280 is OK: All packages OK [23:38:18] PROBLEM - Apache HTTP on srv288 is CRITICAL: Connection refused [23:38:18] RECOVERY - DPKG on srv282 is OK: All packages OK [23:38:28] PROBLEM - Apache HTTP on srv260 is CRITICAL: Connection refused [23:38:28] PROBLEM - Apache HTTP on srv287 is CRITICAL: Connection refused [23:38:28] RECOVERY - DPKG on srv287 is OK: All packages OK [23:38:28] RECOVERY - DPKG on srv267 is OK: All packages OK [23:38:28] RECOVERY - DPKG on srv288 is OK: All packages OK [23:38:29] RECOVERY - DPKG on srv270 is OK: All packages OK [23:38:38] PROBLEM - Apache HTTP on srv269 is CRITICAL: Connection refused [23:38:38] RECOVERY - DPKG on srv279 is OK: All packages OK [23:38:38] PROBLEM - Apache HTTP on srv259 is CRITICAL: Connection refused [23:38:38] PROBLEM - Apache HTTP on srv267 is CRITICAL: Connection refused [23:38:38] PROBLEM - Apache HTTP on srv280 is CRITICAL: Connection refused [23:38:39] PROBLEM - Apache HTTP on srv291 is CRITICAL: Connection refused [23:38:39] PROBLEM - Apache HTTP on srv279 is CRITICAL: Connection refused [23:38:40] RECOVERY - DPKG on srv259 is OK: All packages OK [23:38:40] RECOVERY - DPKG on srv286 is OK: All packages OK [23:38:41] RECOVERY - DPKG on srv269 is OK: All packages OK [23:39:08] RECOVERY - DPKG on srv291 is OK: All packages OK [23:39:08] RECOVERY - DPKG on srv258 is OK: All packages OK [23:39:18] RECOVERY - DPKG on srv284 is OK: All packages OK [23:40:11] !log dist-upgrading praseodymium [23:40:19] Logged the message, Master [23:42:24] sorry in advance for the flood,srv* is down [23:42:26] \o/ [23:42:38] PROBLEM - Host srv279 is DOWN: PING CRITICAL - Packet loss = 100% [23:42:38] PROBLEM - Host srv281 is DOWN: PING CRITICAL - Packet loss = 100% [23:42:38] PROBLEM - Host srv296 is DOWN: PING CRITICAL - Packet loss = 100% [23:42:38] PROBLEM - Host srv293 is DOWN: PING CRITICAL - Packet loss = 100% [23:42:48] that same message a year ago would have sounded so different, heh [23:43:38] PROBLEM - Host srv253 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:38] PROBLEM - Host srv239 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:38] PROBLEM - Host srv251 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:38] PROBLEM - Host srv260 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:48] PROBLEM - Host srv274 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:48] PROBLEM - Host srv301 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:48] PROBLEM - Host srv284 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:48] PROBLEM - Host srv262 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:48] PROBLEM - Host srv269 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:54] srv is down, yet wikipedia is fine. [23:44:15] !log all tampa based apaches have had kernel upgrades [23:44:25] Logged the message, RobH [23:45:08] RECOVERY - Host srv272 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [23:45:08] RECOVERY - Host srv296 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [23:45:18] RECOVERY - Host srv264 is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [23:45:18] PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100% [23:45:28] RECOVERY - Host srv267 is UP: PING OK - Packet loss = 0%, RTA = 26.78 ms [23:45:37] !log dist-upgrading nickel [23:45:45] Logged the message, Master [23:45:48] RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [23:46:18] PROBLEM - Apache HTTP on srv244 is CRITICAL: Connection refused [23:46:28] PROBLEM - Apache HTTP on srv249 is CRITICAL: Connection refused [23:46:28] PROBLEM - Apache HTTP on srv235 is CRITICAL: Connection refused [23:46:28] PROBLEM - Apache HTTP on srv236 is CRITICAL: Connection refused [23:46:38] PROBLEM - Apache HTTP on srv239 is CRITICAL: Connection refused [23:46:38] PROBLEM - Apache HTTP on srv252 is CRITICAL: Connection refused [23:46:38] PROBLEM - Apache HTTP on srv238 is CRITICAL: Connection refused [23:46:38] PROBLEM - Apache HTTP on srv256 is CRITICAL: Connection refused [23:46:38] PROBLEM - Apache HTTP on srv193 is CRITICAL: Connection refused [23:46:48] PROBLEM - Apache HTTP on srv301 is CRITICAL: Connection refused [23:46:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:46:58] PROBLEM - Apache HTTP on srv246 is CRITICAL: Connection refused [23:46:58] PROBLEM - Apache HTTP on srv242 is CRITICAL: Connection refused [23:46:58] PROBLEM - Apache HTTP on srv251 is CRITICAL: Connection refused [23:46:58] PROBLEM - Apache HTTP on srv281 is CRITICAL: Connection refused [23:47:08] PROBLEM - Apache HTTP on srv263 is CRITICAL: Connection refused [23:47:18] PROBLEM - Apache HTTP on srv245 is CRITICAL: Connection refused [23:47:18] PROBLEM - Apache HTTP on srv277 is CRITICAL: Connection refused [23:47:18] PROBLEM - Apache HTTP on srv261 is CRITICAL: Connection refused [23:47:18] PROBLEM - Apache HTTP on srv293 is CRITICAL: Connection refused [23:47:18] PROBLEM - Apache HTTP on srv248 is CRITICAL: Connection refused [23:47:19] PROBLEM - Apache HTTP on srv253 is CRITICAL: Connection refused [23:47:19] PROBLEM - Apache HTTP on srv300 is CRITICAL: Connection refused [23:47:20] PROBLEM - Apache HTTP on srv276 is CRITICAL: Connection refused [23:47:20] PROBLEM - Apache HTTP on srv264 is CRITICAL: Connection refused [23:47:28] PROBLEM - Apache HTTP on srv258 is CRITICAL: Connection refused [23:47:29] PROBLEM - Apache HTTP on srv268 is CRITICAL: Connection refused [23:47:29] PROBLEM - Apache HTTP on srv271 is CRITICAL: Connection refused [23:47:29] PROBLEM - Apache HTTP on srv237 is CRITICAL: Connection refused [23:47:29] PROBLEM - Apache HTTP on srv285 is CRITICAL: Connection refused [23:47:29] PROBLEM - Apache HTTP on srv257 is CRITICAL: Connection refused [23:47:29] PROBLEM - Apache HTTP on srv240 is CRITICAL: Connection refused [23:47:30] PROBLEM - Apache HTTP on srv275 is CRITICAL: Connection refused [23:47:30] PROBLEM - Apache HTTP on srv250 is CRITICAL: Connection refused [23:47:31] PROBLEM - Apache HTTP on srv274 is CRITICAL: Connection refused [23:47:31] PROBLEM - Apache HTTP on srv299 is CRITICAL: Connection refused [23:47:32] PROBLEM - Apache HTTP on srv265 is CRITICAL: Connection refused [23:47:32] PROBLEM - Apache HTTP on srv255 is CRITICAL: Connection refused [23:47:38] PROBLEM - Apache HTTP on srv297 is CRITICAL: Connection refused [23:47:38] PROBLEM - Apache HTTP on srv254 is CRITICAL: Connection refused [23:47:38] PROBLEM - Apache HTTP on srv292 is CRITICAL: Connection refused [23:47:38] PROBLEM - Apache HTTP on srv243 is CRITICAL: Connection refused [23:47:38] PROBLEM - Apache HTTP on srv272 is CRITICAL: Connection refused [23:47:39] PROBLEM - Apache HTTP on srv286 is CRITICAL: Connection refused [23:47:39] PROBLEM - Apache HTTP on srv289 is CRITICAL: Connection refused [23:47:40] PROBLEM - Apache HTTP on srv298 is CRITICAL: Connection refused [23:47:40] PROBLEM - Apache HTTP on srv295 is CRITICAL: Connection refused [23:47:41] PROBLEM - Apache HTTP on srv290 is CRITICAL: Connection refused [23:47:41] PROBLEM - Apache HTTP on srv294 is CRITICAL: Connection refused [23:47:42] PROBLEM - Apache HTTP on srv247 is CRITICAL: Connection refused [23:47:42] PROBLEM - Apache HTTP on srv283 is CRITICAL: Connection refused [23:47:43] PROBLEM - Apache HTTP on srv273 is CRITICAL: Connection refused [23:47:48] PROBLEM - Apache HTTP on srv296 is CRITICAL: Connection refused [23:48:02] !log dist-upgrading singer [23:48:08] PROBLEM - HTTP on nickel is CRITICAL: Connection refused [23:48:10] Logged the message, Master [23:48:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.510 second response time [23:49:28] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.141 second response time [23:49:28] RECOVERY - Apache HTTP on srv237 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.127 second response time [23:49:28] RECOVERY - Apache HTTP on srv250 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.215 second response time [23:49:28] PROBLEM - NTP on cp1013 is CRITICAL: NTP CRITICAL: Offset unknown [23:49:38] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.151 second response time [23:50:08] RECOVERY - HTTP on nickel is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.001 second response time [23:50:18] RECOVERY - Apache HTTP on srv245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.529 second response time [23:50:18] RECOVERY - Apache HTTP on srv248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.582 second response time [23:50:18] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.135 second response time [23:50:18] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.245 second response time [23:50:28] RECOVERY - Apache HTTP on srv241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.124 second response time [23:50:28] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.126 second response time [23:50:28] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.033 second response time [23:50:28] RECOVERY - Apache HTTP on srv255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.297 second response time [23:50:28] RECOVERY - Apache HTTP on srv236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.129 second response time [23:50:38] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.127 second response time [23:50:38] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.129 second response time [23:50:38] RECOVERY - Apache HTTP on srv243 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.144 second response time [23:50:38] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.307 second response time [23:50:38] RECOVERY - Apache HTTP on srv247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.126 second response time [23:50:39] RECOVERY - Apache HTTP on srv193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.232 second response time [23:50:48] PROBLEM - Host singer is DOWN: PING CRITICAL - Packet loss = 100% [23:50:58] RECOVERY - Apache HTTP on srv246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.126 second response time [23:50:58] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.144 second response time [23:50:58] RECOVERY - Apache HTTP on srv251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.270 second response time [23:51:08] RECOVERY - Apache HTTP on srv263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.117 second response time [23:51:18] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.125 second response time [23:51:18] RECOVERY - Apache HTTP on srv277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.128 second response time [23:51:18] RECOVERY - Apache HTTP on srv261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.137 second response time [23:51:18] RECOVERY - Apache HTTP on srv288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.143 second response time [23:51:18] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.127 second response time [23:51:19] RECOVERY - Apache HTTP on srv276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.140 second response time [23:51:28] RECOVERY - Apache HTTP on srv260 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.133 second response time [23:51:28] RECOVERY - Apache HTTP on srv287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.145 second response time [23:51:28] RECOVERY - Apache HTTP on srv285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.123 second response time [23:51:28] RECOVERY - Apache HTTP on srv271 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.342 second response time [23:51:28] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.142 second response time [23:51:29] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.110 second response time [23:51:29] RECOVERY - Apache HTTP on srv257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.256 second response time [23:51:30] RECOVERY - Apache HTTP on srv274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.131 second response time [23:51:30] RECOVERY - Apache HTTP on srv265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.132 second response time [23:51:31] RECOVERY - NTP on cp1013 is OK: NTP OK: Offset 0.07059931755 secs [23:51:38] RECOVERY - Apache HTTP on srv269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.128 second response time [23:51:38] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.140 second response time [23:51:38] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.118 second response time [23:51:38] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.133 second response time [23:51:38] RECOVERY - Apache HTTP on srv267 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.127 second response time [23:51:39] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.112 second response time [23:51:39] RECOVERY - Apache HTTP on srv279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.140 second response time [23:51:40] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.116 second response time [23:51:40] RECOVERY - Apache HTTP on srv259 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.147 second response time [23:51:41] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.115 second response time [23:51:41] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.141 second response time [23:51:42] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.269 second response time [23:51:42] RECOVERY - Apache HTTP on srv291 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.303 second response time [23:51:43] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.320 second response time [23:51:43] RECOVERY - Apache HTTP on srv295 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.355 second response time [23:51:44] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.111 second response time [23:51:44] RECOVERY - Apache HTTP on srv283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.140 second response time [23:51:48] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.131 second response time [23:51:48] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.139 second response time [23:51:58] RECOVERY - Apache HTTP on srv281 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.127 second response time [23:52:08] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [23:52:08] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [23:52:18] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.284 second response time [23:52:18] PROBLEM - Host cp1019 is DOWN: PING CRITICAL - Packet loss = 100% [23:52:18] PROBLEM - Host cp1020 is DOWN: PING CRITICAL - Packet loss = 100% [23:52:28] RECOVERY - Apache HTTP on srv299 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.115 second response time [23:53:28] RECOVERY - Host cp1020 is UP: PING OK - Packet loss = 0%, RTA = 5.06 ms [23:53:28] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 2.60 ms [23:53:38] RECOVERY - Host cp1019 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [23:54:28] RECOVERY - Apache HTTP on srv268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.140 second response time [23:54:58] RECOVERY - Apache HTTP on srv242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.126 second response time [23:55:18] RECOVERY - Apache HTTP on srv262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.135 second response time [23:55:18] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.133 second response time [23:55:38] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.352 second response time