[00:10:59] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [01:09:57] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [01:11:07] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [01:31:57] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.005211591721 secs [01:32:27] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.001980185509 secs [02:06:03] !log LocalisationUpdate completed (1.22wmf8) at Mon Jul 1 02:06:02 UTC 2013 [02:06:30] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [02:07:04] !log LocalisationUpdate completed (1.22wmf9) at Mon Jul 1 02:07:04 UTC 2013 [02:17:50] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 1 02:17:50 UTC 2013 [03:09:03] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [03:11:02] New patchset: Tim Starling; "Fix error handling in scap scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71313 [03:23:30] New review: Ori.livneh; "Why don't we rename "mw-deployment-vars.sh" to "deployment-env" and declare "die" there, since we're..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71313 [03:23:37] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71313 [03:23:59] d'oh [03:25:50] New review: Tim Starling; "mw-deployment-vars.sh is not available on client servers, as previously noted. It seems to me that a..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71313 [03:29:23] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:33] PROBLEM - DPKG on mc15 is CRITICAL: Timeout while attempting connection [03:31:23] RECOVERY - DPKG on mc15 is OK: All packages OK [03:31:44] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.89 ms [03:33:17] !log testing new scap [03:34:12] what happened to morebots? [03:34:46] afaik it hadn't needed to be manually restarted since the update a few weeks ago [03:34:51] !log tstarling Started syncing Wikimedia installation... : [03:35:09] well, i'll update SAL manually [03:35:14] !log tstarling Finished syncing Wikimedia installation... : [03:35:42] I've got a mule and her name is Sal, [03:36:05] Looks like morebots quit a while ago. [03:36:07] morebots just wants to embarrass me in front of tim [03:36:32] !log tstarling Started syncing Wikimedia installation... : [03:38:17] Yeah, the SAL is missing June 30 and July 1. :-/ [03:39:01] if you "and again" that bugzilla bug i may cry [03:39:29] Heh. [03:39:35] I was just looking for that bug. [03:39:44] And debating whether I should file a new one. [03:40:00] And also debating writing a script to check for morebots. [03:40:07] furthermorebots [03:40:25] moreover [03:40:39] logmsgbot stays connected. [03:40:45] It's strange that morebots struggles so. [03:40:49] i wrote it from scratch [03:41:03] You rewrote morebots you mean? [03:41:07] logmsgbot [03:41:20] we killed the old code [03:41:36] morebots is still an archeological heap [03:41:45] Oh, rewrote logmsgbot, then. I think the bot (as a concept) has been around since before me. [03:41:54] it's still connected [03:42:20] It's? I don't see a morebots on freenode. [03:42:42] i guess that's why it's not restarting [03:42:49] presumably tim went and checked the actual process [03:42:59] i wonder what weird state it got itself into [03:43:06] Connected and running are different... [03:43:22] prove it [03:43:46] My IRC bot pings an IRC server every minute and kills itself if it doesn't receive a pong. [03:44:04] But every once in a while it'll somehow get stuck in limbo. [03:44:17] it's connected to port 7000 on HUBBARD.CLUB.CC.CMU.EDU [03:44:32] !log tstarling Finished syncing Wikimedia installation... : [03:44:58] which is hubbard.freenode.net [03:46:34] Netsplit? [03:48:18] why doesn't it put timestamps in its log file? [03:52:02] !log restarted morebots [03:52:12] Logged the message, Master [04:00:38] New patchset: Ori.livneh; "Timestamp log messages" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/71315 [04:00:49] ^ TimStarling [04:01:35] it probably wouldn't have helped anyway, since there were no messages about it being booted out of channels, or asked to supply a nickname, or the like [04:03:04] "Identi.ca is converting to pump.io the week of June 30th" [04:03:23] they've threatened to do it before but relaxed the deadline when they realized absolutely no-one migrated [04:03:41] but afaik Ryan disabled identi.ca integration, so perhaps it's unrelated [04:03:56] it was not receiving IRC messages [04:04:11] if identi.ca was broken, it should keep receiving messages from IRC, shouldn't it? [04:06:00] i haven't looked at the backscroll but the bot doesn't ACK a !log until all the places it was supposed to notify have completed [04:07:19] it would be very surprising if it were receiving messages from the channel without being in the channel [04:07:28] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [04:07:35] Not receiving IRC messages and not apparently connected to the rest of freenode sounds like a netsplit or equivalent to me. [04:10:16] TimStarling: how did you check its connection status, and what did you see? [04:10:52] lsof showed: [04:10:54] adminlogb 20258 adminbot 4u IPv4 8395416 0t0 TCP wikitech-static:57198->HUBBARD.CLUB.CC.CMU.EDU:afs3-fileserver (ESTABLISHED) [04:11:06] strace showed: [04:11:07] 1372650153.070033 select(5, [4], [], [], {0, 51423}) = 0 (Timeout) [04:11:07] 1372650153.122075 gettimeofday({1372650153, 122173}, NULL) = 0 [04:11:07] 1372650153.122379 select(5, [4], [], [], {0, 100000}) = 0 (Timeout) [04:11:07] 1372650153.222975 gettimeofday({1372650153, 223084}, NULL) = 0 [04:11:10] etc. [04:19:37] http://poe.perl.org/?POE_Cookbook/IRC_Bot_Reconnecting describes a good reconnect algorithm as having three rules, the third being "Occasionally ping the server if we haven't seen anything from it." [04:19:48] morebots doesn't [04:19:48] I am a logbot running on wikitech-static. [04:19:49] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [04:19:49] To log a message, type !log . [04:21:25] irclib does have a 'set_keepalive' method on the ServerConnection object which causes it to ping the server at regular intervals [04:21:56] but there's no logic to trigger a disconnect if not reply was received [04:22:08] *no reply [04:24:37] but writing to the socket will cause an EPIPE if the socket is closed for writing, so maybe that's enough to expose a zombie connection [04:24:41] Trigger a reconnect you mean? [04:25:34] yeah, the irc library that it uses reconnects by adding a connect call to the 'disconnect' event handler IIRC [04:26:24] i'll file a bug but i don't want to deal with this myself, irc bot programming makes me feel like i'm 12 [04:26:46] hah, 12 [04:26:53] * jeremyb runs away :) [04:29:50] Elsie: do you have the bot's quit message in your log? [04:30:30] the last message it acknowledged was Sat Jun 29 02:11:56 UTC 2013 [04:36:24] 07:18 morebots has left IRC (Ping timeout: 276 seconds) [04:37:57] "Interesting" mystery on ops-l [04:43:41] Sorry, I was showering. [04:43:56] Plus I have quit messages disabled in my client because they're awful. [04:54:04] PROBLEM - Puppet freshness on wtp1015 is CRITICAL: No successful Puppet run in the last 10 hours [05:01:04] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:10] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [05:08:20] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [05:19:00] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:29:30] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: Offset unknown [05:31:18] New review: Ori.livneh; "> mw-deployment-vars.sh is not available on client servers, as previously noted." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71313 [06:08:47] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [07:11:53] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [07:19:09] hello [07:40:32] PROBLEM - Puppet freshness on mw41 is CRITICAL: No successful Puppet run in the last 10 hours [07:40:32] PROBLEM - Puppet freshness on mc1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:32] PROBLEM - Puppet freshness on db1036 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:32] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: No successful Puppet run in the last 10 hours [07:42:32] PROBLEM - Puppet freshness on amssq44 is CRITICAL: No successful Puppet run in the last 10 hours [07:43:34] PROBLEM - Puppet freshness on mw58 is CRITICAL: No successful Puppet run in the last 10 hours [07:43:34] PROBLEM - Puppet freshness on sq42 is CRITICAL: No successful Puppet run in the last 10 hours [07:43:34] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [07:46:46] New patchset: Nikerabbit; "ULS deployment phase 4" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71319 [07:51:12] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:52:02] RECOVERY - DPKG on searchidx1001 is OK: All packages OK [07:53:22] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:12] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [07:55:28] hi hashar [07:55:34] hello :-) [07:55:55] I am playing with squid this morning [07:56:18] oh? [07:56:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:56:59] on beta we had a long recurring bug where page do not get purged [07:57:06] that caused multiple incidents last week [07:57:35] you know squid is on its way out, right? [07:57:37] I think I tracked it down to only plain text cache being purged, where has the Accept-Encoding: gzip ones are not purged [07:57:58] yeah maybe I should set up an instance with varnish [07:58:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [07:59:14] paravoid: thanks for the suggestion :D [07:59:22] I guess it is better to invest time in phasing out squid in beta [07:59:34] the issue is most probably related to the beta squid conf [08:00:19] have we VCLfied XVO? [08:00:49] not really [08:01:05] it's basically impossible with current varnish vary code [08:01:16] so we reimplemented the few things that we use XVO for in VCL [08:01:27] accept-language being the most nasty one [08:01:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:07] sounds interesting. like what? [08:03:13] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.001474499702 secs [08:03:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [08:04:42] hm, maybe accept-language isn't checked in yet, mark was working on it [08:04:51] the mobile redirect horror is in text-frontend.inc.vcl.erb [08:05:09] (operations/puppet, templates/varnish/) [08:05:24] the cookie stuff are there too [08:05:32] accept-encoding is unneeded, since varnish handles this internally [08:05:57] (caches the gzip content, ungzips on the fly if the client doesn't have gzip in its Accept-Encoding, a rare case) [08:08:00] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [08:09:00] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:30] PROBLEM - Puppet freshness on amssq40 is CRITICAL: No successful Puppet run in the last 10 hours [08:10:30] PROBLEM - Puppet freshness on aluminium is CRITICAL: No successful Puppet run in the last 10 hours [08:10:30] PROBLEM - Puppet freshness on amssq31 is CRITICAL: No successful Puppet run in the last 10 hours [08:10:30] PROBLEM - Puppet freshness on amssq37 is CRITICAL: No successful Puppet run in the last 10 hours [08:10:30] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: No successful Puppet run in the last 10 hours [08:11:40] right, but extensions can use a hook to declare additional cookies to vary on [08:11:46] that looks unimplemented still [08:12:02] and it can't be implemented, can it [08:12:29] dunno, i trust you though :P [08:12:30] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: No successful Puppet run in the last 10 hours [08:12:30] PROBLEM - Puppet freshness on analytics1023 is CRITICAL: No successful Puppet run in the last 10 hours [08:12:30] PROBLEM - Puppet freshness on db1029 is CRITICAL: No successful Puppet run in the last 10 hours [08:12:30] PROBLEM - Puppet freshness on db58 is CRITICAL: No successful Puppet run in the last 10 hours [08:12:30] PROBLEM - Puppet freshness on es6 is CRITICAL: No successful Puppet run in the last 10 hours [08:12:31] mark raised the vary restrictions to varnish upstream [08:12:43] New patchset: Hashar; "beta: removes incubator wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70804 [08:12:45] and they said they're going to have something varnish 4.0 to cover our use case [08:12:50] in* [08:13:16] what are the vary restrictions? [08:13:30] PROBLEM - Puppet freshness on amssq53 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:30] PROBLEM - Puppet freshness on cp1024 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:30] PROBLEM - Puppet freshness on cp1036 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:30] PROBLEM - Puppet freshness on cp3009 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:30] PROBLEM - Puppet freshness on cp3012 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:38] https://www.varnish-cache.org/lists/pipermail/varnish-dev/2013-May/007574.html [08:14:26] New patchset: Hashar; "beta: send purges to deployment-cache-text1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71320 [08:14:30] PROBLEM - Puppet freshness on cp1002 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:30] PROBLEM - Puppet freshness on cp1005 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:30] PROBLEM - Puppet freshness on cp1049 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:30] PROBLEM - Puppet freshness on db39 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:30] PROBLEM - Puppet freshness on ekrem is CRITICAL: No successful Puppet run in the last 10 hours [08:14:30] PROBLEM - Puppet freshness on helium is CRITICAL: No successful Puppet run in the last 10 hours [08:14:31] PROBLEM - Puppet freshness on mc1001 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:31] PROBLEM - Puppet freshness on ms10 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:32] PROBLEM - Puppet freshness on mw1043 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:32] PROBLEM - Puppet freshness on mw1087 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:32] PROBLEM - Puppet freshness on mw1106 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:33] PROBLEM - Puppet freshness on mw1063 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:34] PROBLEM - Puppet freshness on mw124 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:34] PROBLEM - Puppet freshness on mw20 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:35] PROBLEM - Puppet freshness on mw43 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:35] PROBLEM - Puppet freshness on mw57 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:36] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:36] PROBLEM - Puppet freshness on srv255 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:37] PROBLEM - Puppet freshness on stat1 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:37] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [08:14:41] bah [08:15:23] New patchset: Hashar; "beta: send purges to deployment-cache-text1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71320 [08:15:53] New review: Hashar; "self merge: that is just for beta" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/71320 [08:16:03] ori-l: not sure if you saw that among the nagios spam [08:16:06] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71320 [08:17:04] paravoid: I filter icinga-wm /Puppet freshness/. reading mark's e-mail [08:18:03] heh, good for you [08:18:16] our monitoring is so broken :) [08:18:30] * Aaron|home reads about BART [08:18:46] AaronSchulz: hello :-] [08:18:54] just work from home? :) [08:19:05] that's what I do, never cared about bart :) [08:20:37] * ori-l is stretching upstart a little https://gist.github.com/atdt/5899090 [08:23:07] ori-l: you should see what the ceph people are doing [08:23:31] root@ms-fe1001:/etc/init# ls ceph-* |wc -l [08:23:31] 12 [08:23:57] root@ms-fe1001:/etc/init# ls ceph-*-all* |wc -l [08:23:57] 6 [08:25:11] https://github.com/ceph/ceph/tree/master/src/upstart [08:25:55] wow, this is great [08:26:08] it looks insane but it's great [08:26:24] calling initctl from within upstart, yes, it's a bit insane [08:26:47] but it might help you, stealing a few of their ideas [08:27:15] it's in the cookbook: http://upstart.ubuntu.com/cookbook/#another-instance-example [08:27:22] of course, that doesn't mean it's a good idea [08:27:46] but upstart was deliberately engineered to make things like that possible [08:27:52] okay [08:28:00] I'm far from an upstart expert [08:28:45] me neither, just had a bit of fun with that doc this weekend :) [08:29:40] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: Offset unknown [08:32:40] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.001249194145 secs [08:33:00] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.0001798868179 secs [08:33:54] ntp mad again :D [08:35:01] ah the varnish cache seems to be working [08:35:07] though there are no purges :-] [08:35:21] going to switch the beta cache from the squid to the varnish [08:37:40] mouaha that works! [08:37:54] mark: beta text cache is now using varnish \O/ Still have to fix the purge system though [09:09:41] New review: ArielGlenn; "Is it possible that with this approach we will remove files still in use?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71149 [09:10:28] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [09:14:08] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [09:14:08] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [09:14:08] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [09:14:08] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [09:14:08] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [09:14:08] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [09:14:09] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [09:14:09] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [09:14:10] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [09:27:16] New review: Ori.livneh; "There's no question that it'd be much nicer to apt-get install bugzilla; I'd be very happy to see th..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [09:28:40] * hashar rolls a dice for ori-l [4] [09:28:43] oh no bugzilla dead :-D [09:28:59] fortunately we fixed all the bugs [09:29:15] :) [09:29:17] I dislike having the bugzilla in puppet [09:29:24] but that is definitely a nice step forward [09:29:43] New review: ArielGlenn; "Tested and looks fine. Thanks." [operations/dumps] (ariel) C: 2; - https://gerrit.wikimedia.org/r/71087 [09:29:44] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/71087 [09:36:30] @notify binasher [09:36:30] I'll let you know when I see binasher around here [09:36:47] New review: Faidon; "https://code.launchpad.net/~hexmode/+junk/bugzilla4 has Mark's current working tree. If you want to ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [09:37:35] hello, is here anyone with access to replica servers for labs? these on 192.168.99.[1-7] ip's [09:37:43] they seem to be down [09:37:48] or lagged [09:38:07] Ryan_Lane ^ [09:38:32] Any ops around to look at dsh potentially not running and blocking L10N updates? See https://bugzilla.wikimedia.org/show_bug.cgi?id=50433 [09:39:46] nvm me [09:42:37] andre__: tim deployed a change to the l10n scripts a few hours ago: https://gerrit.wikimedia.org/r/#/c/71313/ [09:43:03] in the process of testing it, he had a few 'fake' scaps (scaps that were supposed to fail and thus abort) [09:43:38] they generated log messages, though morebots was not around to log them in the SAL (because of https://bugzilla.wikimedia.org/show_bug.cgi?id=50485) [09:43:54] uh. thanks for that info [09:44:27] argh. I knew, I just didn't connect these two bits. :-/ [09:44:30] I'm not sure if the log entries siebrand is referring to also correspond to scaps that were engineered to deliberately abort, but they may have. Also, if there *was* an actual problem, it's possible that Tim's change fixed it. [09:45:15] alright. thanks [10:06:41] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [10:07:13] !log olivneh synchronized php-1.22wmf8/extensions/Campaigns 'I58a424c70: Restore extensions/Campaigns submodule; was accidentally removed by change I2cafcb595.' [10:07:23] Logged the message, Master [10:08:17] PROBLEM - Disk space on labstore3 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [10:17:16] * Adam_WMDE points at the above [10:19:56] any ops around to take a look at labstore3 which seems to have taken down the majority of tool labs :/ [10:24:06] paravoid: apergos: labstore3 /srv is dead :D [10:24:35] uh [10:25:22] labs nfs cluster? oh joy [10:26:12] respnds to ping [10:26:12] New review: Ori.livneh; "OK." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [10:26:29] I'm on the host [10:26:47] ssh in was normal [10:28:50] Jul 1 10:05:38 labstore3 kernel: [23631.808592] XFS (dm-7): metadata I/O error: block 0x880160c00 ("xfs_trans_read_bu [10:28:50] f") error 5 numblks 8 [10:28:50] Jul 1 10:05:38 labstore3 kernel: [23631.818028] XFS (dm-7): xfs_do_force_shutdown(0x1) called from line 364 of file / [10:28:50] build/buildd/linux-lts-quantal-3.5.0/fs/xfs/xfs_trans_buf.c. Return address = 0xffffffffa045ee8c [10:28:52] and some other crap [10:31:05] some filesystem error [10:31:07] I guess [10:31:10] maybe it needs a fsck [10:31:26] Please umount the filesystem and rectify the problem(s) =) [10:31:30] it's clearly a filesystem thing but I don't know which one [10:32:02] Jul 1 10:16:31 10.0.0.43 puppet-agent[3390]: (/Stage[main]/Openstack::Project-nfs-storage-service/Service[manage-nfs-volumes]/ensure) ensure changed 'stopped' to 'running' [10:32:02] bah [10:32:05] there are a bazillion mount points [10:33:25] apergos: there might be some logs in /var/lib/nfsmanager/manage-nfs-volumes.log [10:33:33] that is apparently an upstart daemon that log there [10:34:47] not so much, no [10:36:38] deployment-prep is unreachable as well :D [10:37:22] New patchset: Ori.livneh; "Puppetize Bugzilla" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [10:43:49] ok well ls -d /srv in fact fails [10:44:08] so at this point I'm thinking, how bad can a reboot be? [10:44:29] don't you have the /dev mounted on /srv ? [10:44:47] if unmounted, maybe attempt to fsck it [10:44:50] root@labstore3:/var/lib# ls -l / [10:44:51] ls: cannot access /srv: Input/output error [10:44:58] d????????? ? ? ? ? ? srv [10:46:06] I guess I'll see if I can unnmount it first [10:46:32] might want to stop the manage-nfs-volume service and puppet agent too [10:48:46] xfs_check thinks the ting is mounted, umount thinks it isn't [10:48:49] so that's a fail [10:49:28] any other thoughts than a reboot? [10:49:52] New patchset: Hashar; "role::cache get rid of unrecognized escape sequences" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71329 [10:50:38] apergos: no more idea sorry :( [10:50:53] maybe it is unmounted and the kernel still has open file handles to it [10:51:03] so the device is still busy [10:52:17] very plausible [10:52:22] anyways here goes the reboot [10:53:13] !Log rebooting labstore3, /srv reporting errors, couldn't xfs_check it [10:53:22] Logged the message, Master [10:53:39] at any rate it won't be any more broken by the reboot than by sitting there [10:53:45] I guess so :D [10:53:50] I am heading out for lunch break [10:53:55] not sure I have any added value [10:53:56] ok see ya [10:55:00] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:40] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.86 ms [10:55:57] well it's back but sure were a lot of whine son boot [10:56:00] RECOVERY - Disk space on labstore3 is OK: DISK OK [10:56:31] Adam_WMDE: want to poke around and look at some of the items you use, make sure things look intact? [10:58:04] md/raid:md114: not clean -- starting background reconstruction [10:58:06] apergos you are working on the tools project issues I heard? is it true? do you need some assistance with that? [10:59:06] Jul 1 03:34:40 labstore3 kernel: [ 209.955607] XFS (dm-7): metadata I/O error: block 0x280c813f8 ("xfs_trans_read_bu [10:59:06] f") error 5 numblks 8 [10:59:20] interesting... [10:59:35] lemme look at the most recent entries now [10:59:44] seems like fs is borked a bit [11:00:05] but TBH I have a bad feeling about gluster store as well [11:00:17] it seems somewhat fucked as well, maybe it's something on higher level [11:00:25] like the physical storage being offline [11:00:42] Jul 1 10:55:27 labstore3 kernel: [ 26.306511] md/raid:md114: not clean -- starting background reconstruction [11:00:43] is that the same reason why my project storage is read only? [11:00:51] Nemo_bis yes [11:00:55] Jul 1 10:55:27 labstore3 kernel: [ 26.307792] md/raid:md114: raid level 6 active with 12 out of 12 devices, algorit [11:00:55] hm 2 [11:01:29] is labstore3 gluster or what [11:01:51] nfs [11:02:11] so can't be my problem I suppose [11:02:22] no [11:03:40] root@labstore3:~# cat /proc/mdstat [11:03:40] Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [11:03:40] md114 : active raid6 md127[0] md116[11] md115[10] md125[9] md117[8] md118[7] md121[6] md120[5] md119[4] md122[3] md123[2] md124[1] [11:03:40] 39058141440 blocks super 1.2 level 6, 128k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU] [11:03:40] [=====>...............] resync = 25.5% (999676604/3905814144) finish=939.0min speed=51579K/sec [11:08:18] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [11:09:02] * apergos watches the puppet run and hopes for the best [11:11:58] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [11:24:20] apergos but even on resync the array must be accessible, are you sure this is a root of this? [11:24:33] !log Depooled old eqiad upload varnish servers [11:24:40] I mean, is this a reason why nfs is down [11:24:42] Logged the message, Master [11:25:01] no, I'm just reporting what I see [11:26:23] it took a while for the /time/20130701.1117/ etc mountpoints to show up [11:28:54] there are a bunch of these sorts of things oo: [11:28:56] too: [11:28:57] Jul 1 10:55:27 labstore3 mdadm[2315]: DeviceDisappeared event detected on md device /dev/md/nfs0 [11:29:17] Jul 1 10:55:27 labstore3 mdadm[2315]: DeviceDisappeared event detected on md device /dev/md/pair00, component device Wrong-Level [11:29:31] for 00 through 11 [11:33:26] uh so who just rebooted it again? [11:33:38] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [11:33:42] rebooted what o.O [11:33:46] that ^^ [11:34:38] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.89 ms [11:35:16] someone from the mgmt console [11:36:02] apergos: I read something about a broken file system: Is this related to http://en.wikipedia.beta.wmflabs.org/ ? [11:36:09] it will now take 2142 mins to resync [11:36:26] probably is, yep [11:36:37] apergos: will do :) [11:36:51] Adam_WMDE: actually don't bother [11:36:54] still broken [11:36:58] :( okay [11:37:14] they are coming back up [11:37:37] but not by me [11:39:45] so whoever was kind enough to fix labstore3 (because it does seem like the mounts are showing up now after the second reboot), could they kindly log what they did? [11:43:14] also: Jul 1 11:41:48 labstore3 kernel: [ 463.910819] XFS (dm-7): metadata I/O error: block 0x9800f7100 ("xfs_trans_read_buf") error 5 numblks 8 [11:46:09] Yes, I am. [11:46:12] ah [11:46:19] yeah so I had seen: [11:46:26] Jul 1 11:41:48 labstore3 kernel: [ 463.910819] XFS (dm-7): metadata I/O error: block 0x9800f7100 ("xfs_trans_read_buf") error 5 numblks 8 [11:46:28] I still trying to figure out what happened. [11:46:33] stuff like that and then a bunch of whines from xfs earlier [11:46:40] but I see we are about do have this again [11:46:46] *abuot to [11:46:49] grr typos [11:47:07] so when I got on, since you missed the crollback here, [11:47:19] attempts to ls -d /srv gave [11:47:44] an error, don't remember precisely [11:47:49] but ls -l / gave: [11:47:52] (01:44:50 μμ) apergos: root@labstore3:/var/lib# ls -l / [11:47:52] (01:44:50 μμ) apergos: ls: cannot access /srv: Input/output error [11:47:52] (01:44:56 μμ) apergos: d????????? ? ? ? ? ? srv [11:48:16] umount appeard to succeed but nfs_check said the filesystem was mounted so at that point I rebooted [11:48:41] Hm. That's not the same problem. [11:48:43] /proc/mdstat said it would take 942 minutes to resync, something like that [11:48:45] How odd. [11:49:13] Yeah, I see one of the raid0 is reconstructing now. [11:49:32] yes, but now it wants some much longer time (maybe it lies, hard to know) md114 right? [11:49:43] * Coren nods. [11:49:54] anyways the srv mount points did not come back, only the /time/... ones after some delay [11:50:07] You did the start-nfs dance, right? [11:50:08] while I was looking to see what I could possibly kick, the system rebooted [11:50:15] by a good samaritan :-P :-D [11:50:34] no, I assumed that puppet or automagically on restart would take car of that [11:50:36] no? [11:51:10] Ah. No. As a paranoid measure, the NFS service doesn't start automatically (in case of borked filesystems) [11:51:24] On reboot, one needs to "start-nfs" [11:51:25] ah hah [11:51:32] ok well I sure had no idea aobut that :-D :-D [11:51:34] the other thing I wanted to point out Coren was that access to s1 - s7 don't work on -login since reboot yesterday, is there any guide how to do that? can you document it? [11:51:47] petan: That shouldn't be related. [11:52:01] yes I know but it's another problem that needs to be solved :D [11:52:09] I thought I remembered checking that nfs was running too [11:52:16] I am telling you before I forget [11:52:31] ok so [11:52:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:52:45] apergos: start-nfs does more than just start the NFS service; it assembles the raid-of-raids and starts some of the LDAP-related stuff. [11:52:51] ah I see [11:52:56] apergos: What you are telling me and Ryan is "document this crap!" [11:52:56] that would be it then [11:53:15] erm yeah that would be nice [11:53:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [11:53:29] I did look on wikitech and didn't find something quickly [11:53:58] so I see the /time/blahblah mountpoints are not back; how crucial are those? I guess those have some sort of snpahosts [11:54:00] snapshots [11:54:52] (ah also when it's all back together maybe !log what you did, otherwise people will think I fixed it) [11:56:57] Jul 1 11:54:19 labstore3 kernel: [ 1212.926199] XFS (dm-7): metadata I/O error: block 0x9801ce100 ("xfs_trans_read_buf") error 5 numblks 8 [11:57:14] 4 of those, different blocks [12:00:36] New review: QChris; "> What happens when there is already a repo at wikimedia/puppet-cdh4?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71248 [12:01:30] Jul 1 12:00:45 labstore3 kernel: [ 1599.132702] XFS (dm-7): Corruption detected. Unmount and run xfs_repair [12:01:33] and there is that [12:01:37] apergos: They're mount-on-demand snapshots. [12:01:43] Jul 1 12:00:45 labstore3 kernel: [ 1599.139310] nfsd: non-standard errno: -117 [12:01:51] which means that device is now dead again I expect [12:02:02] see kern.log [12:02:11] Coren: ^^ [12:02:15] I see [12:02:17] ffs. [12:02:36] this is what I saw before I rbooted; I had the dim hope that this would clear up whatever xfs issues [12:02:57] * Coren should have gone ext4. [12:03:22] any great ideas? [12:03:45] Jul 1 12:03:17 labstore3 kernel: [ 1750.093693] nfsd: last server has exited, flushing export cache [12:03:49] and we're back to that too now [12:04:05] No, that was me. [12:04:14] ah :-) [12:04:16] I'm trying to find /which/ filesystem is borked. [12:04:43] in the previous case ls -d /srv faield; right now that command seems ok [12:05:01] re [12:05:11] hi hashar [12:05:19] you didn't miss the excitement, we saved some for you [12:05:59] is NFS on labstrore3 still dead ? :( [12:06:25] yes [12:06:27] hashar: It's back up now, but it's probably a matter of time before it booms again: some xfs metadata is broke. [12:06:29] xfs issues still [12:06:51] maybe a kernel bug so ? [12:07:13] it's dm-7 both times so ... not sure if it's kernel [12:07:40] apergos: Ima umount it and xfs_repair it now. NFS will hang for a bit. [12:07:55] ok, maybe your umount-fu is better than mine [12:08:00] !log Depooled old eqiad mobile cache servers [12:08:07] I'm in /root so not using the mounts [12:08:08] Logged the message, Master [12:08:33] mark: hello! I have migrated beta cluster to use your varnish text cache instance :-] [12:08:40] \o/ [12:08:49] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [12:09:18] mark: works fine for now but I could use your help to enable purging there. The VCL restricts the purge to 127.0.0.1 , I guess they are sent by a local HTCP handler which we do not use on beta :/ [12:09:36] xfs_repair in progress. [12:09:48] you should use it in beta [12:10:26] so labs currently doesn't support multicast, but we can probably send unicast packets to it? [12:10:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:52] mark: I guess so [12:11:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [12:11:21] I am also wondering how we invalidates url for bits.wm.o / en.wm.o [12:11:23] apergos: I wonder how long an xfs_repair over a 30T filesystem takes. [12:11:35] is the whole thing one giant filesystem? [12:11:39] cause if it is... [12:11:39] 3 days :P [12:11:45] yes it is [12:11:47] apergos: Thinly provisioned, but yes. [12:12:04] * apergos goes to get popcorn :-P [12:12:09] :D [12:12:19] * Coren grumbles. Should have gone ext4. [12:12:29] I still think you should have gone btrfs :P [12:12:35] subvolumes ftw... [12:12:55] online fsck, grow, shrink... <3 [12:13:06] I'm at Phase 3 [12:13:07] 3 years from now (arbitrary number) [12:13:19] petan: Those are features of ext4 and xfs too. [12:13:30] online shrink? where is it? [12:13:35] I don't believe that [12:13:53] Oh, shrink? I don't think I've done that, so you may be right. [12:13:59] Ooo. xfs_repair all done. [12:14:02] afaik online fsck can damage FS, on ext [12:15:47] New patchset: Mark Bergsma; "Migrate esams upload backends to use new eqiad servers as upstream" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71334 [12:16:06] did it have anything to say? [12:16:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71334 [12:17:37] ohhh btrfs ha dynamic inode allocation [12:17:50] apergos: I'm reading it now. [12:19:06] apergos: From what I can read, it doesn't so much as check metata as just preemptively rebuild it. [12:19:28] Jul 1 12:19:16 labstore3 kernel: [ 2708.145124] XFS (dm-7): metadata I/O error: block 0x341a8 ("xfs_trans_read_buf") error 5 numblks 8 [12:19:29] that's new [12:19:44] apergos: That happened during the repair, afaict [12:20:04] Jul 1 12:19:40 labstore3 kernel: [ 2731.561729] XFS (dm-7): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5. [12:20:04] root@labstore3:~# date [12:20:04] Mon Jul 1 12:19:57 UTC 2013 [12:20:05] nope [12:20:13] Ah, no it idn't [12:20:27] unfortunately [12:20:38] you ran it over /srv I guess? [12:21:04] * Coren nods. [12:21:14] sh*tballs [12:21:38] It's actually *working* right now as far as I can tell. [12:24:41] apergos: Ah. [12:25:02] apergos: [12:25:05] That's clearly a 3.8 kernel you are seeing this on. The readahead [12:25:05] has returned zeros rather than data which implies that the readahead [12:25:05] has been cancelled. I can see a potential bug in the 3.8 code where [12:25:05] the above verifier is called even on an IO error. Hence a readahead [12:25:05] will trigger a corruption error like above, even though a failed [12:25:06] readahead is supposed to be silent. [12:25:06] [12:25:06] A failed readahead is then supposed to be retried by a blocking read [12:25:07] when the metadata is actually read, and this means that failed [12:25:08] readahead can be ignored. [12:25:41] New patchset: Mark Bergsma; "Set esams backend director to chash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71335 [12:26:23] ok but will it be? i.e. the last coupld of rounds we have gotten from xfs, this: (dm-7): Corruption detected. Unmount and run xfs_repair [12:26:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71335 [12:27:34] apergos: At this point, I'm hoping that the corruption was cause by that failed mirror and that the xfs_repair will have done the trick. [12:27:53] Jul 1 12:26:43 labstore3 kernel: [ 3154.400477] XFS (dm-7): Corruption detected. Unmount and run xfs_repair [12:28:00] hope srings eternal... [12:28:04] springs, too [12:28:22] * Coren starts to panic, just a little, now. [12:28:49] nice stacktrace but not much use to me I admit [12:29:09] Jul 1 12:26:43 labstore3 kernel: [ 3154.374652] XFS (dm-7): corrupt dinode 82028, (btree extents). [12:29:30] that's the other whine I see that might possibly lead us somewhere [12:30:45] maybe not [12:30:55] Jul 1 12:00:45 labstore3 kernel: [ 1599.106898] XFS (dm-7): corrupt dinode 737293, (btree extents). [12:30:55] Jul 1 12:26:43 labstore3 kernel: [ 3154.374652] XFS (dm-7): corrupt dinode 82028, (btree extents). [12:31:13] two different ones. can those be examined with xfs_check or something? [12:31:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:33:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [12:33:44] mark: so as I understand it the Apaches are sending PURGE on a multicast group. Varnish caches have a varnishhtcpd daemon listening to the group and happily sending the purge to its local varnish instance. [12:34:12] (ah also this is a 3.5 kernel, not 3.8, in case that makes a difference) [12:34:23] mark: so seems like if mediawiki send a purge request is sent for a bits.wm.o url , it is sent to any varnish cache regardless of the role. Or am I missing some filtering/ routing process? [12:34:38] apergos: It shouldn't, the bit about 3.8 was unrelated. [12:34:38] bits doesn't have purges [12:35:53] mark: what about upload ? :-) [12:36:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:36:38] apergos: Most of the snapshots are gone?! [12:37:10] ugh [12:37:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:37:27] i promise I did not remove them. nor ven succeed in running an xfs_check (let alone repair) [12:37:39] Yeah, didn't say it was you. :-) [12:37:53] ther were some log entries after the first reboot to the effect: [12:38:24] Jul 1 11:34:28 labstore3 mdadm[2369]: DeviceDisappeared event detected on md device /dev/md/pair10, component device Wrong-Level [12:38:31] a number of those, dunno if they are related [12:38:32] but also [12:38:47] Jul 1 11:17:16 labstore3 snapshot: Logical volume "20130701.1017" successfully removed [12:38:50] some of these [12:39:07] That would be normal "old snapshots timing out" [12:39:16] ok [12:39:37] beyond that I have nothing helpful [12:39:37] the filesystem is borked. [12:39:40] ugh [12:39:56] It's kinda-sorta limping along now, but not much more. [12:40:09] hashar: configure mediawiki to send udp messages, not http purge requests [12:40:24] and yeah, mediawiki doesn't have any notion of roles currently [12:40:35] we used to have one in 1.20 [12:40:50] And whathever it is it not raid-related; there is one mirror being rebuilt, but the raid6 on top didn't even blink. [12:41:04] with $wgHTCPMulticastRouting that took an URL regex as key and the value was host => port [12:41:17] no, I expect the raid rebuild would happen in the background, with maybe a small impact on performance but that's it [12:41:26] mark: so if we had a multicast address per varnish role, that could be done in MediaWiki configuration :) [12:41:47] yes [12:41:50] we need that [12:41:53] why was it removed? [12:42:05] the code is still there [12:42:09] maybe it never got used [12:42:33] apergos: The filesystem is thinly provisioned. I'm considering creating an ext4, moving all the stuff to it, and trashing the xfs [12:43:01] well, making a copy sounds like a fine idea regardless [12:43:04] $wgHTCPMulticastAddress has been deprecated in favor of $wgHTCPMultiCastRouting which has the URL regex [12:43:05] ext4 is not quite as performant, but it has much saner fail mode. [12:43:26] 375g that shuoldn't take too long [12:43:35] do you have a place to put a copy? [12:43:37] mark: when using the Address parameter, Setup.php will create an entry for any url in the wgHTCPMultiCastRouting configuration. [12:44:32] so can you set a different address per url? [12:44:50] yup [12:44:59] mark: example at http://www.mediawiki.org/wiki/Manual:$wgHTCPMulticastRouting [12:45:20] why did noone tell us that was there :D [12:45:35] so yeah, try if unicast addresses work there? [12:45:41] it probably only accepts a single address eh? [12:45:46] mark: the change is from June 2012 by Roan https://gerrit.wikimedia.org/r/#/c/4486/ [12:45:53] ah good [12:45:53] mark: never tested though [12:46:17] apergos: Same volume group, I'll create a new thin volume. [12:46:19] you could bring the subject in the eng/ops weekly meeting I guess [12:46:31] so is this deployed? [12:46:40] yup [12:46:45] been there for a year :-] [12:46:58] but that code path has never been run / used though [12:47:24] ok [12:47:45] so potentially for beta, I could match the URL to point the purge requests to the proper varnish [12:50:39] yes [12:50:41] if there is only one [12:52:19] * Coren is not having fun. [12:52:51] simplewiki-0fbe64e1: 0.7736 12.0M [squid] SquidPurgeClient (10.4.0.51): unexpected status code: 403 Requested target domain not allowed. [12:52:51] :D [12:55:35] dmped the two inodes via xfs_db, scrying the output now [12:56:14] apergos: Ima doing an rsync of the xfs contents over to the ext4 [12:56:20] great [12:56:38] * Coren is immensely pleased we have < 50% usage atm. :-) [12:56:45] no kidding [12:59:26] Does looking into the inodes entrails reveal our fate? [12:59:46] not yet [13:01:35] Im just trying to figure out if the inode is bad or if the whine is bogus [13:06:51] rsync: readlink_stat("/srv/deployment-prep/project/apache/common-local/.git/refs/tags/jenkins_build_289") failed: Input/output error (5) [13:07:10] I knew it. It's depoloyment-prep's fault! :-P [13:07:31] crappppola [13:08:27] rsync: readlink_stat("/srv/deployment-prep/project/apache/common-local/php-master/extensions/.git/modules/AWS/logs/refs/remotes") failed: Input/output error (5) [13:08:29] I can of course read that file [13:08:45] With matching dmesg: [13:08:46] Jul 1 13:08:05 labstore3 kernel: [ 5632.628497] XFS (dm-7): metadata I/O error: block 0x1600 ("xfs_trans_read_buf") error 5 numblks 16 [13:08:50] booo [13:09:14] try doing them in smaller batches I guess [13:09:31] Hm? Oh, no, the rsync is continuing merrily. [13:09:42] ok but did it pick up that file? [13:10:04] if it did then great, if it skipped it we will have to go back and get these... maybe there won't be so many [13:10:34] Skipped. [13:10:46] New patchset: Hashar; "beta: debug log groups for mwsearch and squid" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71340 [13:10:48] 3 files now. [13:10:53] can feed list to rsync separately later [13:10:58] * Coren nods. [13:11:00] or cp -a or whatever [13:11:37] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71340 [13:11:52] * Coren puts the files rsync fails to copy in ~/missing [13:12:59] ... why can I ls and cat them but rsync can't? [13:15:03] it probably can if you ask it again [13:15:15] if this is a transient error [13:17:46] Heisenbugs. The most "fun" you can have with a computer! [13:21:05] ctime is 12:26:43 for the first corrupt dinode [13:21:07] Jul 1 12:26:43 labstore3 kernel: [ 3154.374652] XFS (dm-7): corrupt dinode 82028, (btree extents). [13:21:13] so that's fishy all right [13:21:24] mtime the same [13:21:58] core.size is 2514919466 [13:22:02] wonder what the [13:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:47] my guess is these probably really are bad for whatever reason [13:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [13:27:04] !log reedy synchronized wmf-config/InitialiseSettings.php [13:27:12] Logged the message, Master [13:27:27] New patchset: Reedy; "Add Meta as an import source on cs.wikipedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71344 [13:27:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:35] New patchset: Reedy; "Add Meta as an import source on cs.wikipedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71344 [13:29:08] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71344 [13:31:46] gah only 19g copied, it's killing me [13:32:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [13:33:53] Coren: bear in mind ext4 is 16tb limit, we have 32 on /srv [13:35:05] apergos: Heh. That's ext3. :-) [13:35:30] ext4 is an exabyte's worth. [13:35:38] ah? *whew* [13:35:56] yay for that [13:36:03] why are you moving to ext4? [13:37:10] mark: As a panicy move due to random xfs corruption in the current setup. [13:37:30] you suspect ext4 won't have corruption? [13:37:59] mark: I suspect nothing atm, but regardless of how things develop I'd rather have two copies on different filesystems. :-) [13:38:00] i.e. it's not related to the underlying stack? [13:38:06] right [13:39:44] mark: It doesn't /look/ like it's the underlying stack, though that's always a little hard to pin down. I expect a comparison will help. [13:40:12] ok [13:40:18] New patchset: Hashar; "beta: make use of wgHTCPMulticastRouting" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71345 [13:40:33] New patchset: Hashar; "beta: make use of wgHTCPMulticastRouting" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71345 [13:40:34] Reedy: I am doing some beta related changes :-] [13:40:50] mark: What I /do/ see is kernel oopses in the xfs vfs; so that's the primary culprit. [13:41:22] why are you running 3.5? [13:41:29] these are less tested/stable [13:41:48] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71345 [13:43:07] New patchset: Mark Bergsma; "Add a custom error page to the upload cache cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71346 [13:44:10] New patchset: Mark Bergsma; "Add a custom error page to the upload cache cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71346 [13:49:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71346 [13:49:29] New patchset: Hashar; "beta: $wgSquidServers is no more needed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71348 [13:50:01] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71348 [13:50:15] paravoid: Needed for LVM thin provision. [13:50:31] and why do you need that? [13:50:46] paravoid: Which is, in turn, needed for snapshots. [13:51:06] have you considered that the corruption bug might be somewhere there instead of XFS? :) [13:51:24] paravoid: But I'm 99.7% sure that's not the issue; I've been running that very kernel in much higher volume applications for a while. Only with ext4. [13:51:43] paravoid: With snapshots [13:52:16] i think it's more likely to be LVM related than xfs [13:52:16] paravoid: Doubtless it may be an interaction /between/ snapshots, the 3.5 kernel, the raid, and xfs. [13:53:55] mark: Your confidence in XFS is admirable. We shall know better one way or the other eventually. [13:54:17] where did I state confidence in XFS? [13:54:39] I think I only expressed relative confidence compared to LVS [13:54:46] er LVM [13:54:48] Ah. :-) [13:55:14] apergos: after en.wp.beta seems to be back up and working: Is this also related to it? "An error has occurred while searching: The search backend returned an error: Error opening index." [13:55:16] we've had issues with the combination LVM snapshots (not thin) and XFS before [13:55:37] what's the controller on that box? [13:55:39] I have no idea about the search [13:55:54] paravoid: Two different PERCS [13:56:03] which percs? [13:56:08] I would expect thngs generally to work but maybe be a bit sluggish [13:56:26] apergos: only happens when I search for something with File:-prefix [13:56:36] H700 & H800, okay [13:56:43] paravoid: A H700 and a H800 [13:56:44] on en beta? [13:56:53] http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=File%3Aca&fulltext=Search [13:56:54] New review: Manybubbles; "On second perusal I can't claim about half of this. Most of the scripts came from upstream's attemp..." [operations/debs/jmxtrans] (debian) - https://gerrit.wikimedia.org/r/71079 [13:57:05] great [13:57:57] mark: LVM is just block mapping; it has few moving parts and I've used that setup before with no issues. But perhaps the interaction between them and xfs has issues. [13:58:20] lvm is far from "just block mapping" [13:58:27] and it has lots of moving parts ;-) [13:58:29] especially with snapshots (thin or not) [13:58:36] moving blocks especially [13:58:45] I've seen lots of LVM bugs in the past [13:58:50] ranging from corruption to deadlocks [13:58:51] paravoid: Old-style snapshots were an abomination. :-) [13:58:52] New patchset: Mark Bergsma; "Add a custom error page to the bits cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71349 [13:59:19] few*er* moving parts than a filesystem. I to was only expressing relative complexity. :-) [13:59:39] "moving blocks"? LVM shouldn't be moving blocks around. [13:59:43] http://bugs.debian.org/659762 [13:59:51] that's one that's affecting us in the Debian infrastructure right now [14:00:06] "Another “me too” here. Nightly LVM snapshots cause all I/O to the snapshotted LV to hang. None of the dmsetup commands bring them back. The only way to bring things back is to power cycle the server, corrupting data." [14:00:11] that's basically the description [14:00:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71349 [14:00:58] reports say that thin snapshots are not affected, this is just an argument of how LVM can be fragile [14:01:13] even more so on software raid setups [14:01:45] but anyway [14:02:00] two different fs for redundancy doesn't seem like a bad idea to me [14:02:07] Indeed. :-) [14:02:22] as long as you have a contingency plan for an lvm thin snapshot corruption bug [14:02:34] paravoid: Avoids the all the eggs in single basket issue. [14:03:13] paravoid: Thin snapshots do not redirect blocks away from the snapshotted system like the old-style did. You could lose snapshots, but normally not the filesystem. [14:03:32] At any rate, I'm not going to be snapshotting the ext4 at all until I know where the issue lives. [14:03:40] (for sure, that is) [14:03:53] mark: I have enabled on beta the HTCP multicast routing https://gerrit.wikimedia.org/r/#/c/71345/2/wmf-config/squid-labs.php,unified :-D [14:04:04] still have to handle upload though [14:04:19] and mobile [14:04:24] ah yeah mobile [14:04:26] multicast? on labs? [14:04:33] does it work? [14:04:36] no [14:04:42] we only have one varnish box per role [14:04:48] so it is actually sending unicast udp packets [14:04:54] oh, okay [14:05:56] I don't even see purge requests sent for mobile urls hehe [14:06:18] se4598, let me ask a silly question [14:06:25] hashar: they aren't [14:06:30] it's the same purges as text [14:06:32] did this search for File:... work, say, yesterday? [14:06:34] but mobile servers use those [14:06:46] apergos: don't know :) [14:06:51] ahahaha [14:06:52] ok well [14:07:02] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [14:08:14] we may not have that index [14:08:17] * apergos looks at hashar [14:08:45] New patchset: Mark Bergsma; "Add a custom error page to the mobile cache cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71350 [14:08:50] apergos: yum ? [14:09:20] search index for searchs in the File: namespace? [14:09:23] en.beta [14:09:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71350 [14:10:17] or de.beta, same error [14:10:41] other randomly selected namespaces appear to be ok [14:11:16] I have no idea [14:11:19] maybe it is not indexed [14:12:29] there's not a lot in there (en beta) but I would still expect it not to generate an error [14:12:35] http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3APrefixIndex&prefix=&namespace=6&hideredirects=1 [14:21:38] New patchset: Mark Bergsma; "Switch cp3003 frontend to use chash weight 100" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71351 [14:22:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71351 [14:27:15] If I log in to tin with agent forwarding, "dsh -m terbium echo test" outputs 'test'. But "sudo -u l10nupdate dsh -m terbium echo test" outputs nothing (and $? is 1). Is it supposed to do that? If not, I think that's what's causing bug 50433. [14:27:55] se4598, can you bugzilla that please? and thanks for reporting [14:28:14] will do [14:33:15] New patchset: Mark Bergsma; "Switch cp3004 frontend to use chash weight 100 as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71352 [14:33:28] All of that said, I'm thinking that the current problem might not be a xfs problem at all in the first place. The bi-weekly NFS wedge may well have caused the filesystem to be wedged when it was rebooted causing "real" corruption in the metadata. [14:33:42] Or at least, I'm not excluding the possibility. [14:39:35] apergos: bug 50498 and a new one about missing image scalers at bug 50499 [14:39:51] missing image scalers? uh oh shouldn't be [14:40:00] Error retrieving thumbnail from scaling server: couldn't connect to host [14:43:02] I just had scaling work for me [14:46:33] I've duplicated your error though, strange [14:47:09] apergos: What about preview og http://en.wikipedia.beta.wmflabs.org/wiki/File:Test_image_2013-05-13_08-43.jpg [14:47:14] *of [14:50:05] Somebody set us up the bomb on fenari? [14:50:15] huh? [14:50:24] PROBLEM - Host labstore3 is DOWN: CRITICAL - Plugin timed out after 15 seconds [14:50:30] Oh noes. [14:50:33] It's labstore3 [14:50:40] Just went boom? [14:50:47] ugh [14:50:56] I am off, will be back later tonight for some conf calls [14:51:02] and that would kill all my lab instances [14:51:09] no debugging those for now [14:51:10] Goes check console. [14:51:14] * apergos closes all their tabs [14:51:27] apergos: They're not dead, they're just stunned. [14:51:48] if you hadn't nailed them to the perch they'd be pushing up the saidies [14:51:51] *daisies [14:52:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71352 [14:53:17] sure seems down doesn't it [14:53:30] <^demon> Anyone going to rage if I restart gerrit? Got some stuff that got stuck in caches during the upgrade friday, and I need to bring it down for a minute or two to fully flush everything. [14:54:01] no, but can you... [14:54:07] * apergos smiles sweetly [14:54:19] ^demon everybody is going to rage, but I hereby give you my approval to do that :3 [14:54:24] http://www.mediawiki.org/wiki/Git/New_repositories/Requests the one at the end for my gsoc person [14:54:39] <^demon> apergos: Was going to run through that list after I do this. [14:54:40] sometime day? :-) :-) [14:54:45] ah thanks so much [14:54:51] *sometime today [14:54:54] PROBLEM - Puppet freshness on wtp1015 is CRITICAL: No successful Puppet run in the last 10 hours [14:54:54] typing fail [14:55:18] apergos: The box just went and faceplanted without so much as a warning. [14:55:22] awesome [14:55:26] apergos: rebooting now. [14:55:37] ok [14:55:38] * Coren is having *lots* of fun today, after all. [14:55:43] me too [14:55:55] it's been really productive. (sarcasm emoticon here) [14:56:44] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.74 ms [14:57:06] gerrit seems down... [14:57:17] odder: yes, its all ^demon [14:57:49] <^demon> !log gerrit: brought offline, removed /var/lib/gerrit2/review_site/cache/*, brought back up [14:57:57] Jul 1 14:49:04 labstore3 kernel: [11681.800368] general protection fault: 0000 [#1] SMP [14:57:57] Jul 1 14:49:04 labstore3 kernel: [11681.805331] CPU 5 [14:57:58] Logged the message, Master [14:58:01] Fun! [14:59:20] At least the clients deal well with the NFS going out and back. [14:59:56] Jul 1 14:49:04 labstore3 kernel: [11681.800368] general protection fault: 0000 [#1] SMP [14:59:56] Jul 1 14:49:04 labstore3 kernel: [11681.805331] CPU 5 [14:59:59] ah you already saw it [15:00:01] yep [15:00:38] 147gb [15:00:44] keep going little rsync, you can do it [15:01:35] It's back at work. [15:01:40] chug chug chug [15:01:53] <^demon> Now I've got an image of rsync trying to get up the hill... [15:01:54] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: No successful Puppet run in the last 10 hours [15:01:58] <^demon> "I think I can I think I can" [15:02:41] New review: Anomie; "Change seems sane to me." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/70804 [15:04:47] * apergos tail -fs the kernel log and leaves the tab open [15:04:49] stilly thing [15:04:56] *silly [15:08:49] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [15:24:44] New patchset: Mark Bergsma; "Support different backends for 1st/2nd mobile tier" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71356 [15:37:05] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:56] Jul 1 15:35:51 labstore3 kernel: [ 2379.704401] general protection fault: 0000 [#1] SMP [15:38:56] Jul 1 15:35:51 labstore3 kernel: [ 2379.709369] CPU 4 [15:39:12] Coren: ^^ [15:39:34] that's 40 minutes after the reboot.. meh [15:39:51] * Coren cries. [15:39:57] me too [15:40:21] there's call trace and other junk, don't image you want it for anything [15:42:53] Not useful. We know the xfs driver is getting ill and causing the boom. [15:43:09] I'm leaving NFS down while I finish the rsync [15:43:20] the xfs driver isn't getting ill, it's complaining (as it should) that it found a corruption [15:43:35] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [15:44:36] New review: Anomie; "I see a few other static references to bits.wikimedia.org around the files. Besides the ones flagged..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70322 [15:44:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71356 [15:44:54] paravoid: By definition, anything that causes the kernel to crash is a bug. [15:45:36] Does an xfs_repair, a manual mount readonly, and an rsync. [15:45:47] that's not exactly true, but yes the kernel shouldn't crash. [15:45:50] in this case [15:46:29] e.g. echo c > /proc/sysrq-trigger causes a panic, this isn't a bug :) [15:46:54] paravoid: Technically, that's no more a crash than 'halt' is. :-) [15:47:33] anyway, are you sure the panic is xfs related? [15:47:59] paravoid: That's where the stack trace ends. [15:48:29] a stack trace isn't necessarily a crash [15:48:47] Please capture the filesystem metadata with xfs_metadump and [15:48:47] report it to xfs@oss.sgi.com. [15:49:27] if the corruption is caused by an underlying layer, like the hardware or a block layer like LVM, that's not a problem caused by XFS [15:49:32] Preceeded by "This is a bug." [15:49:41] oh then it must be true [15:49:46] :-) [15:49:55] (he was being sarcastic) [15:50:10] or, say, a bitflip due to a memory error [15:50:32] XFS is just a likely layer to find that corruption [15:50:39] Well, arguably, a journaled file system should be able to not become inconsistent on block layer failure but yeah, that doesn't isolate the problem just tell us where it current manifests. [15:50:53] what are you talking about? :) [15:51:24] if the filesystem told the underlying layer to write 1001 and it wrote 1000 then there's nothing a journal can do to help you [15:51:40] (or it was corrupted in memory) [15:51:42] or if the underlying layer simply did nothing at all except say "ok!" [15:52:34] At this point, I doubt those hypotheses are useful. [15:53:08] if you don't get what we're talking about, then indeed, they are not useful [15:53:36] all we're trying to say is that you're drawing a conclusion based on incomplete evidence [15:53:55] And I'm trying to tell you that I have yet to draw a conclusion. [15:54:07] ... and the box is dying again. [15:54:31] dying how? [15:54:32] New patchset: Odder; "(bug 50377) Enable 'autopatrolled' group on hewikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71362 [15:54:45] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:40] paravoid: kernel oops while doing xfs_repair [15:56:01] backtrace? [15:56:06] Whatever the corruption was is clearly getting worse. [15:56:20] paravoid: It crashed hard before I could get it. [15:56:36] I have a ~2h old snapshot I'll use. [15:56:59] did you the have the fs mounted when you ran xfs_repair? [15:57:07] paravoid: Of course not. [15:57:21] so, the kernel oopsed when you ran a userspace issue? [15:57:23] er, userspace program? [15:57:29] and you still think it's XFS-related? :) [15:57:44] (xfs_repair reads/writes XFS in userspace) [15:57:45] paravoid: No, when I tried mounting the repaired filesystem. [15:58:05] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [15:58:14] "while doing" was obviously unfelicitous. "while doing the repair-remount-read-only" [15:58:36] okay [15:59:07] !log reseating disk 3 db44 [15:59:16] Logged the message, Master [15:59:35] tails syslog [16:00:40] Yeesh. Just trying to read the filesystem floods syslog with oopses now. [16:01:49] 20130701.1317 snapshot seems okay-ish [16:01:55] doesn't oops. [16:02:00] these are not oopses [16:02:06] backtrace != oops [16:02:21] backtraces. [16:02:24] Sorry. [16:02:26] it just tells you it found corruption [16:02:33] and when it does, it prints a backtrace [16:02:49] O. xfs-generated backtraces. [16:02:54] :) [16:03:11] rsync working fine on the snapshot. [16:04:18] Snapshots are at block device level? [16:04:46] scfc_de: Yeah; it asks the fs layer to stabilize, then marks the current allocation slice. [16:05:20] scfc_de: New modified blocks get allocated to new thin slices instead, so the older ones are just never touched. [16:06:28] Yep, I use that with LVM on my laptop. Did anything memorable happen since 13:17? [16:06:58] scfc_de: Lots of people being annoyed at NFS. I doubt any real work took place. :-) [16:07:35] :-) No, I meant reboots and other stuff that could have corrupted the FS. [16:07:51] scfc_de: At least two hard crashes of the kernel. :-) [16:09:00] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [16:09:25] in meeting for a bit [16:09:33] Coren: That would be a good explanation (for corruption afterwards; obviously not for why the kernel crashed in the first place). [16:10:02] nah, it wouldn't be [16:10:14] this isn't ext2 :) [16:10:24] Well, it *shouldn't* be. [16:11:11] a corruption is likely because of either a kernel bug (in either of the filesystem/vfs/lvm/md layers) or a hardware bug [16:11:28] (disk/memory/pci corruption) [16:12:12] The Labs NFS server uses standard drives? Any RAID on top? [16:12:32] scfc_de: Software, the raid spans over more than one controller. [16:13:02] raid6 over 12 pairs of raid0 [16:13:19] (each half of the pairs on a different controller) [16:13:48] Coren: But that would catch "simple" disk errors, wouldn't it? [16:14:19] scfc_de: It would. That's no guarantee against corruption in-ram, or a bug. [16:14:40] Yep. [16:15:18] no it wouldn't [16:15:51] no it wouldn't what? [16:16:03] md doesn't catch corruptions [16:16:20] o_O how do you figure? [16:16:30] because it doesn't? [16:16:53] there's no checksumming on md [16:17:02] or read validation of what was written [16:17:23] periodically you ask md to check whether parity is correct [16:17:26] ... no checksuming on raid6? The parities are just for show? :-) [16:17:33] parity is not checksumming [16:18:17] Catches bad blocks or errors nonetheless. [16:18:30] bad blocks that propagate as i/o errors [16:18:36] not corruption [16:18:51] ... which the drive layer. Oh! You mean corruption that was /written/ to disk. [16:18:59] Well yeah, but then it comes from the kernel at some point. [16:19:05] no kind of corruption [16:19:18] er, all kinds of corruptions even [16:19:24] md doesn't protect you from any of those [16:19:25] I think we're working on different definitions of "corruption" [16:19:43] whether it's a memory, disk, controller or firmware bug [16:20:04] To me, corruption simply means "reading data different than what you asked written" [16:20:11] yes, that's what corruption means to me too [16:20:18] New patchset: Mark Bergsma; "Tier 2 mobile cache backends should talk to port 3128 upstream" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71363 [16:20:18] New patchset: Ori.livneh; "Set common rsync and dsh parameters in mw-deployment-vars" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57890 [16:20:29] md doesn't offer protection from corruption [16:20:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71363 [16:21:49] filesystems are getting checksums nowadays to catch corruption issues in the whole stack [16:22:04] before they end up killing your filesystem that is :) [16:23:27] I'm not seeing your point. If you send a block to md, and read back something else, then there obviously was a bug in *md*. If the data was corrupted on-disk, md will catch /that/. [16:23:39] no, it won't [16:24:37] Well, unless the hardware doesn't flag the block error. [16:24:45] if you tell md to write a block, and md tells the disks to write it and one of the disks gets corrupted, md will notice when you ran a check on the md again (in a week, a month or a year) but it wouldn't be able to tell you which of the disks is the corrupted on and which one is the correct [16:25:16] New patchset: Mark Bergsma; "Pass $default_backend to VCL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71365 [16:25:19] paravoid: Yes you will. You'll know because the corrupted disk is the one that doesn't match the parity. [16:25:31] parities in the case of raid6 [16:25:46] s/disk/block/ [16:26:25] But when you say "and one of the disks gets corrupted" you mean "and one of the disks gets corrupted and the ECC mysteriously still matches" [16:26:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71365 [16:26:52] what ecc? [16:27:13] ... the one on the disk? [16:27:17] parity errors are going to be detected when you force a parity check on the whole drive [16:28:14] Okay, we're talking past each other. [16:28:22] and "doesn't match the parity" doesn't mean it knows which one is correct and which one is wrong :) [16:28:36] md layer sends to disk. Disk writes on rust. If rust doesn't have proper ECC, disk returns error. [16:28:38] i.e. it will detect the error but wouldn't be able to repair it [16:29:05] raid6 will repair up to two faulty blocks off the array. [16:29:25] faulty != corrupted [16:29:58] if your disks contradict each other on what the data are, you don't know which disk is corrupted and which one isn't [16:30:24] Yes, you do. You'd need to have three "dissenting" disks to loose the ability to tell. [16:30:36] Regardless, if there is an md check that we could run, it might be nice to do that after the switch to put some minds at ease :-). [16:32:53] * Coren ponders. [16:33:34] Coren: what's the status? [16:33:59] Maybe two dissenting disks that do not report fault might be ambiguous. One isn't. [16:34:12] AzaToth: ~280 of 380G copied. [16:34:15] Coren: http://permalink.gmane.org/gmane.linux.raid/20937 [16:34:18] Coren: Neil Brown = md author [16:34:18] ok [16:34:54] As has been said elsewhere in this thread, silent corruption is rarely [16:34:54] if ever caused by the storage device. They tend to have strong CRCs [16:34:54] etc which detect bit-flips with greater reliability than the RAID6 [16:34:54] algorithm would detect them. [16:35:01] Point, set, match. :-) [16:35:11] md doesn't do that [16:35:15] he says that there [16:35:30] But that's my point! [16:35:31] linux md offers no extra protection from corruption, period [16:35:59] New patchset: Mark Bergsma; "Pass $cluster_tier to VCL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71366 [16:36:23] *silent* corruption. A.k.a.: bug. [16:36:39] sigh. [16:36:43] ok, I'm getting tired [16:36:58] If your drive returns you different data than you wrote to it and doesn't flag the error, then it has failed. [16:37:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71366 [16:37:24] That is the Ur-example of "hardware failure" [16:38:18] Coren: do you think the glusterfuckfs issue could have been the underlying cause for following issue I had late yesterday: https://bugzilla.wikimedia.org/show_bug.cgi?id=50480 [16:39:33] AzaToth: Hard to tell. "503 perhaps you'd like to try again" is hardly diagnotic-friendly. :-) [16:39:52] Coren: heh [16:41:30] New patchset: Mark Bergsma; "Set cp3005 frontend to use chash weight 100" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71367 [16:41:57] On the plus side, the snapshot has very very few errors. [16:42:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71367 [16:42:48] On tin, "sudo -u l10nupdate ssh terbium /bin/echo test" doesn't output "test", it just exits with status 1. Debug output from -vvv looks like it's connecting fine and sending the command, but the command doesn't seem to actually get run on the remote host. Anyone know what's the deal there? It seems to be causing bug 50433. [16:45:17] back [16:46:33] New patchset: Manybubbles; "Initial Debian packaging for jmxtrans." [operations/debs/jmxtrans] (master) - https://gerrit.wikimedia.org/r/71370 [16:48:33] New patchset: Manybubbles; "Initial Debian packaging for jmxtrans." [operations/debs/jmxtrans] (master) - https://gerrit.wikimedia.org/r/71371 [16:48:52] grrr - those aren't new..... [16:52:47] manybubbles: you need to keep the Change-Id the same [16:53:42] anomie: l10nupdate's shell is set to /bin/false, so that's expected, no? [16:54:49] ori-l: Somehow or other it worked until recently, given that sync-l10nupdate-1 used to work and now doesn't actually sync stuff. [16:55:21] manybubbles both are dupes [16:55:45] AzaToth: indeed. [16:55:47] ori-l: I don't know if someone changed the shell or what. I don't see anything likely in the puppet repo. [16:56:12] AzaToth: git review is complaining that I'm trying to push 2 changes when one I _know_ is already up there. [16:56:27] manybubbles: you are on the wrong branch then [16:56:44] hmmm - like I'm trying to push to the wrong branch? [16:56:59] which branch are you on? [16:57:10] and which branch is specified in .gitreview? [16:57:28] and which branch is upstream if your current branch? [16:58:00] anomie: what breakage are you seeing? [16:58:21] manybubbles: hmm, I cloned the repo, and there is already a debian dir [16:58:32] ori-l: The message files don't actually get synced anywhere, because all the ssh sessions dsh fires off don't actually execute the rsync command. [16:58:41] ori-l: bug 50433. [16:58:57] AzaToth: on master there is on. It is upstream's attempt but doesn't really conform at all. [17:00:11] manybubbles: I can't find any .gitreview file in the repo [17:00:53] AzaToth: doesn't have one yet. part of the problem, I assume. [17:01:02] !reloaded exim on mchenry for config change [17:01:25] manybubbles: best is to make the first new commit to a repo the gitreview sole [17:02:08] AzaToth: I certainly know that now. [17:02:21] manybubbles: perhaps we should keep a standard in the naming of branches [17:02:23] But I'm not really sure what to do about it now that I haven't. [17:02:38] I use wmf [17:03:01] AzaToth: for which branch? debian packaging or upstream? [17:03:25] wmf for "upstream" and wmf-debian for debian branch [17:03:32] see gerrit for example [17:05:00] New patchset: AzaToth; "adding gitreview" [operations/debs/jmxtrans] (debian) - https://gerrit.wikimedia.org/r/71373 [17:05:07] abuot 70gb to go... [17:05:17] apergos: gigabit? [17:05:20] anomie: on tin, zcat /var/log/l10nupdatelog/l10nupdate.log-20130701.gz | tail -30 [17:05:27] manybubbles: ↑ [17:05:35] of rsync [17:05:48] apergos: gigabit or gigabyte? [17:05:53] byte [17:06:11] ori-l: Unrelated. [17:06:56] New patchset: Manybubbles; "Initial Debian packaging for jmxtrans." [operations/debs/jmxtrans] (debian) - https://gerrit.wikimedia.org/r/71079 [17:07:14] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [17:08:33] AzaToth: cool. I imagine we can merge your change and then mine on top of it. [17:09:01] manybubbles: your change includes build.xml change [17:09:30] AzaToth: It does. [17:10:10] New review: AzaToth; "the debian branch shouldn'd make any changes outside the debian dir, build.xml shouldn't not be dire..." [operations/debs/jmxtrans] (debian) C: -1; - https://gerrit.wikimedia.org/r/71079 [17:11:40] AzaToth: isn't the point of git-buildpackage that it allows you to make such changes and builds the patch files appropriately? [17:11:59] anomie: i'm probably being daft, but the log file suggests that it last run 1 Jul 2:00 UTC, and that is consistent with the mtime of files in /usr/local/apache/common-local/php-1.22wmf8/cache/l10n on the apaches [17:12:10] so i'm not sure what's wrong [17:12:17] manybubbles: you want to use gdb-pq [17:12:23] anomie / ori-l, wait, sync-l10nupdate-1 doesn't work any more? [17:12:30] csteipp: No. [17:12:32] New patchset: Mark Bergsma; "Set cp3006 frontend to use chash weight 100" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71374 [17:12:39] Crappy [17:12:41] ori-l: Tim synced it manually at some point after 2:00 UTC, while fixing a different bug [17:12:53] So have to do a full scap to get l10n synced? [17:13:01] csteipp: It seems that the problem is that the l10nupdate user can't execute commands via ssh anymore. [17:13:30] csteipp: I'm trying to ping ops people to get someone to look into why it's not working. [17:13:35] So can l10n sync at all? Or do messages just not sync? [17:13:59] * csteipp turns around to get the attention of ops, and sees empty desks... [17:14:05] manybubbles: also you can't specift origin/wikimedia as upstream branch ヾ [17:14:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71374 [17:14:31] and you are not using format 3.0 [17:14:33] AzaToth: it has to be 'wikipedia'? [17:14:35] * apergos peeks in (there are other opsen here) [17:14:57] The cron job run nightly regenerates the messages fine. But then it doesn't actually sync the updated files out. It'll eventually go out when someone runs a scap under their normal user account, which may be why it wasn't noticed until the weekend. [17:15:03] * AaronSchulz verifies the general emptiness of said desks [17:15:19] csteipp: we get to bug them in person in 45 minutes :) [17:15:46] manybubbles: or wmf, or whatever, but not "origin" [17:15:48] I am having so much "fun" today. [17:15:57] yep, loads [17:16:33] AzaToth: could you provide me a link to format 3.0? [17:16:40] * apergos is tail -f kern.log again (since the reboot) [17:17:31] what command are you running exactly, and it's from tin right? [17:17:52] anomie: [17:18:24] manybubbles: man dpkg-source [17:19:11] apergos: tin, yes. For testing, "sudo -u l10nupdate ssh terbium /bin/echo test". The actual problem is a few levels inside /usr/local/bin/l10nupdate-1 run from the l10nupdate user's crontab. [17:19:29] ok, I was just looking at the update-1 script [17:20:30] (/usr/local/bin/l10nupdate-1 calls /usr/local/bin/sync-l10nupdate-1, which calls dsh to do an rsync, which calls ssh to actually execute the command on all those remote hosts.) [17:21:42] !log csteipp synchronized php-1.22wmf9/extensions/CentralAuth 'Updating wmf9 to latest CentralAuth' [17:21:51] Logged the message, Master [17:22:34] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:23:24] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [17:23:40] New patchset: AzaToth; "Initial Debian packaging for jmxtrans." [operations/debs/jmxtrans] (debian) - https://gerrit.wikimedia.org/r/71079 [17:23:57] manybubbles: ↑ [17:24:24] hi [17:24:34] manybubbles: diff it to rev 2 [17:25:20] anybody in ops with info-en access care to respond to a ticket related to DNS configuration? [17:25:41] AzaToth: checking. gerrit seems just about dead for me though [17:26:30] lfaraone: i don't think I have otrs access [17:26:34] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:35] but you can forward it to us [17:26:41] mark: preferred address? [17:26:46] hostmaster@wikimedia.org would be appropriate ;) [17:27:00] New patchset: MaxSem; "Re-enable Nearby in Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71375 [17:27:32] Coren: progress? [17:28:44] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:44] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:29:45] 324/376 [17:29:53] AzaToth: ^^ [17:30:35] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [17:32:18] anomie: I don't think the problem is the l10update user ssh access [17:32:41] apergos: What do you think it is then? [17:33:03] I don't know yet, but I jsut ran a simple command using the same dsh construct [17:33:15] sudo -u l10nupdate bash /home/l10nupdate/testme.sh [17:33:19] where testme.sh has [17:33:29] #!/bin/bash [17:33:29] dsh -o -oPasswordAuthentication=no -F 30 -cM -g mediawiki-installation \ [17:33:29] "sudo -u mwdeploy uptime" [17:33:35] so same dsh line, some boring command [17:33:50] and I got the two standard whines about [17:34:06] srv193 and mw1173 [17:34:13] Shouldn't you also be getting a line from every successful server with its uptime? [17:34:27] and btw it also complained it couldn't add som ersa host key to /home/l10nupdate/.ssh/blah [17:34:37] New patchset: CSteipp; "Use 302 redirects for central login on test wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71377 [17:34:38] so *some rsa [17:35:02] yes, I ought to see output, but I can at least tell I'm not being rejected on those other hosts [17:37:13] lfaraone: i actually have otrs access iirc [17:37:41] lfaraone: which ticket ? [17:37:54] * paravoid guesses it's about AXFR [17:38:28] apergos: Yeah. ssh seems to connect fine, but the command isn't getting executed. With -vvv, ssh says "debug1: sending command: ...", then "debug2: exec request accepted on channel 0", then "debug2: channel 0: read<=0 rfd 4 len 0" and "debug2: channel 0: read failed". If I connect to tin with agent forwarding and run it under my account, it instead actually reads the output from the command. [17:38:52] that is weird indeed [17:39:25] AzaToth: I've reviewed what you did and it makes sense. I'd comment on gerrit but it isn't loading for me. Thanks very much. [17:40:05] LeslieCarr: 2013062010008035 [17:40:28] lfaraone: thanks, looking now [17:40:53] appreciated! maggie sent it over last week, but the correspondant emailed today saying they hadn't heard back [17:41:22] PROBLEM - Puppet freshness on mc1005 is CRITICAL: No successful Puppet run in the last 10 hours [17:41:22] PROBLEM - Puppet freshness on mw41 is CRITICAL: No successful Puppet run in the last 10 hours [17:41:42] hahaha [17:41:44] omfg [17:41:54] what is it? [17:42:15] I am contacting you because I recently came across what should be our URL, nagios.wikimedia.org , but it turns out that a competitor of ours, icinga, has somehow redirected it to their website. [17:42:28] i do have a redirect from nagios.wikimedia to icinga.wikimedia [17:42:28] say what? [17:42:49] paravoid: the best part … it's a "marketing specialist" for nagios enterprises [17:42:56] yeah, hilarious [17:43:20] lfaraone: oh man, this is really funny -- and also not even a problem, though I will write back [17:43:22] PROBLEM - Puppet freshness on db1036 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:22] PROBLEM - Puppet freshness on amssq44 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:23] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: No successful Puppet run in the last 10 hours [17:43:23] oh my... [17:43:28] New patchset: CSteipp; "Use 302 redirects for central login on test wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71377 [17:43:31] I think they don't understand what the site is. Like, they think it is somehow related to WIkipedia and it "represents" them or something [17:43:34] LeslieCarr: I know, right? [17:43:47] :) [17:43:56] apergos: what's ms-be3's status? it's been complaining about puppet for a while [17:44:22] PROBLEM - Puppet freshness on mw58 is CRITICAL: No successful Puppet run in the last 10 hours [17:44:22] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [17:44:22] PROBLEM - Puppet freshness on sq42 is CRITICAL: No successful Puppet run in the last 10 hours [17:44:49] lfaraone: ok, i feel silly but can't figure out the how to reply to this person [17:44:57] it is? [17:45:15] The last Puppet run was at Mon Jul 1 17:39:36 UTC 2013 (4 minutes ago). [17:45:18] LeslieCarr: under "Compose Answer" , make sure "empty answer" is selected, then click "Compose" [17:45:39] lfaraone: i don't see compose answer -- is it possible i don't have the correct permissions ? [17:45:54] or maybe because it's "status lock" ? [17:45:58] which agrees with the puppet output in syslog over there [17:46:00] looks ok to me [17:46:07] LeslieCarr: http://i.imgur.com/CVvEPbl.png [17:46:08] and yet see above :) [17:46:15] LeslieCarr: ah, yes... I'll assign it to you. [17:46:41] apergos/paravoid gah! somehow snmptrapd died again [17:46:43] LeslieCarr: done, you should be able to do so [17:46:51] ah hah [17:46:51] i'm guessing it's due to the crazy inode situation again [17:46:52] noooes [17:47:42] so, a lot of files appear and the fs gets out of inodes [17:47:53] then someone deletes them and then it happens again after a while? [17:48:02] doesn't sound like a good strategy to me :) [17:48:34] no, we need to figure out the root cause, aka why smptt is acting this way [17:48:56] New review: Anomie; "Looks sane." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/71377 [17:49:01] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71377 [17:56:09] wow there's a lot of pre-made stuff [17:56:37] lfaraone: i'm guessing that means that all of those are incredibly common problems you have to respond to ? thanks for putting up with that :) [17:56:50] andrewbogott: hey [17:56:59] andrewbogott: what's your lab instance where you're trying rsync? [17:57:00] LeslieCarr: haha yes, and some less-common problems. happy to help :) [17:57:24] that pid file change of yours seems peculiar as you note, so I thought I might have a look too [17:57:35] and I was too lazy to create a new instance just for that :) [17:58:27] paravoid, puppet-cleanup-rsync2 [17:58:51] paravoid: I'm pretty sure that rsyncd manages its own process if you specify a pid file, and doesn't if you don't -- that best explains the behavior. [17:59:01] Of course I haven't read the manpage, 'cause I'm dumb [17:59:46] paravoid, the code in /var/lib/git/operations/puppet is just a live hack, rsynced from elsewhere. So feel free to blow away those changes. [17:59:51] !log csteipp synchronized wmf-config/InitialiseSettings.php 'pushing config flag for centralauth silent login' [17:59:59] Logged the message, Master [18:01:09] !log csteipp synchronized wmf-config/CommonSettings.php 'enablign silent redirects for test wikis' [18:01:18] Logged the message, Master [18:01:23] gerrit doesn't work for me [18:01:38] paravoid: more than usual? [18:01:45] yes [18:01:50] I can't open changesets at all [18:02:20] you tried hard refreshing? [18:03:31] (in a meeting) [18:04:19] manybubbles: there's still some to do with cleaning, as it modifies html files [18:07:59] AzaToth: I noticed that running `ant javadoc` modifies checked in files and I'm not really sure what to do about that. I had guessed this is intentional on upstream's part. [18:08:20] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [18:09:38] manybubbles: well, if you are planning to run javadoc, then you shouldn't have the html files in the repo in the first place [18:11:10] PROBLEM - Puppet freshness on aluminium is CRITICAL: No successful Puppet run in the last 10 hours [18:11:10] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: No successful Puppet run in the last 10 hours [18:11:10] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: No successful Puppet run in the last 10 hours [18:11:10] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: No successful Puppet run in the last 10 hours [18:11:10] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: No successful Puppet run in the last 10 hours [18:11:13] greg-g: I'm done with deploying.. you need me to join you in the techops meeting? Or is it all ve this week? [18:12:21] AzaToth: I agree but what is the policy when upstream does this? Do we drop the files in our upstream batch and add a gitignore for them and add them to the clean list? [18:13:10] PROBLEM - Puppet freshness on analytics1023 is CRITICAL: No successful Puppet run in the last 10 hours [18:13:11] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: No successful Puppet run in the last 10 hours [18:13:11] PROBLEM - Puppet freshness on db1029 is CRITICAL: No successful Puppet run in the last 10 hours [18:13:11] PROBLEM - Puppet freshness on es6 is CRITICAL: No successful Puppet run in the last 10 hours [18:13:11] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [18:13:11] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: No successful Puppet run in the last 10 hours [18:13:26] manybubbles: there's no policy how to fix it [18:13:53] it's always best if upstream fixes their shit, but often it's easier just to clean the shit up [18:14:18] AzaToth: I'm not sure they consider it broken. That, I suppose, is often the problem. [18:14:46] if they are using git, then it's obvious it's broken [18:15:10] PROBLEM - Puppet freshness on cp1002 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:10] PROBLEM - Puppet freshness on cp1049 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:11] PROBLEM - Puppet freshness on cp1005 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:11] PROBLEM - Puppet freshness on ekrem is CRITICAL: No successful Puppet run in the last 10 hours [18:15:11] PROBLEM - Puppet freshness on db39 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:11] PROBLEM - Puppet freshness on helium is CRITICAL: No successful Puppet run in the last 10 hours [18:15:11] PROBLEM - Puppet freshness on ms10 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:12] PROBLEM - Puppet freshness on mc1001 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:12] PROBLEM - Puppet freshness on mw1043 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:13] PROBLEM - Puppet freshness on mw1063 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:13] PROBLEM - Puppet freshness on mw1087 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:14] PROBLEM - Puppet freshness on mw1106 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:14] PROBLEM - Puppet freshness on mw124 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:15] PROBLEM - Puppet freshness on mw20 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:15] PROBLEM - Puppet freshness on mw43 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:16] PROBLEM - Puppet freshness on mw57 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:16] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:17] PROBLEM - Puppet freshness on srv255 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:17] PROBLEM - Puppet freshness on stat1 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:18] PROBLEM - Puppet freshness on titanium is CRITICAL: No successful Puppet run in the last 10 hours [18:16:16] New patchset: Jgreen; "manually route donate.wikimedia.org mail to aluminium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71380 [18:16:30] I'll blast it in our wmf branch then. [18:17:36] anomie: I haven't forgotten you but I'm not making any progress on it either [18:18:24] New patchset: MaxSem; "Alternative way of setting resource paths" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71381 [18:18:25] apergos: Thanks for the update. I guess sshd on the remote host isn't logging anything useful? [18:19:47] New patchset: MaxSem; "Alternative way of setting resource paths" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71381 [18:20:00] not that I've found [18:21:05] New patchset: MaxSem; "Alternative way of setting resource paths" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71381 [18:21:51] apergos: Any way to start sshd with -ddd (on an alternate port, maybe)? [18:22:10] New patchset: Jgreen; "add donate.wikimedia.org to secondary mx relay_domains" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71382 [18:22:53] New patchset: Manybubbles; "Remove javadoc from source tree." [operations/debs/jmxtrans] (wikimedia) - https://gerrit.wikimedia.org/r/71384 [18:26:22] New patchset: Manybubbles; "Remove javadoc from source tree." [operations/debs/jmxtrans] (wikimedia) - https://gerrit.wikimedia.org/r/71384 [18:27:07] debug2: channel 0: read<=0 rfd 10 len -1 [18:27:07] debug2: channel 0: read failed [18:27:13] that's from the remote (mwxxx) side [18:28:10] Hmm. What's it trying to read there? [18:28:50] the command maybe?? [18:28:58] I mean this is well afer [18:28:59] after [18:29:28] debug1: monitor_child_preauth: l10nupdate has been authenticated by privileged process [18:29:30] can someone here help me recover my RT password? [18:29:58] and debug1: Entering interactive session for SSH2. [18:29:59] etc [18:30:07] jdlrobson: yep [18:30:25] thanks mutante [18:30:58] maybe it's some pty/tty thing but dang if I can see it [18:31:14] mutante: what do ui need to do get a reset? - i can't even remember what my username is on it which doesn't help [18:31:41] jdlrobson: hold on, i'll send a mail to the address it uses [18:31:52] thanks mutante [18:32:35] jdlrobson: so here's the thing , you have 2 users [18:32:53] do you get mail on both of these? jdrobson@ and jdlrobson@ ? [18:33:15] mutante: no mails on either yet.. [18:33:23] should be jdlrobson@ though [18:36:08] LeslieCarr: yeah, he still isn't happy; I tried to explain what you said in your mail to him previously but he's quite persistent. [18:36:17] jdlrobson: this is how it happens: if you mail RT you get an auto-created user with limited privileges and then there are also "full" users [18:36:22] LeslieCarr: I agree no action is actually required. [18:36:37] jdlrobson: use just "jrobson" as login and check that same inbox [18:37:49] lfaraone: so i forwarded that to legal [18:37:55] to see if we actually have to take it down or not [18:38:06] may have to refer him to legal if he is insistent [18:38:41] LeslieCarr: ah, okay. there's a "forward" link in OTRS so that the message is recorded and the reply goes back on the ticket, but that works too :) [18:39:02] oh oops [18:39:10] no worries, the UI sucks, but all support ticketing software sucks :/ [18:39:23] i'm only at OTRS level1 … i still need 3 more encounters before i can level up [18:39:30] haha. [18:44:18] So OTRS only displays the last 2k tickets you've answered, depressingly. [18:45:10] I just did a query, and lol... I've answered 1.6k tickets in the last year. [18:46:50] my head hurts thinking of that [18:47:21] lfaraone: I don't have enough words to thank you [18:48:22] sumanah: happy to help :) [18:49:10] most of them are templated responses, so those are an easy "read ticket, click on template dropdown, type first letters of template, hit compose, send". [18:50:05] apergos: Things are up at 100%, as far as I can tell. I'm keeping a tail -f on syslog for teh forseeable future. [18:50:39] heh [18:50:44] I'm camped on there too [18:54:25] gerrit still doesn't open [18:54:38] wfm [18:54:59] <^demon|lunch> Using firefox? [18:55:01] oh..now "Working" [18:55:02] paravoid: It breaks for me too, but only in Firefox, only when logged in, and only when visiting certain changes but not otheres [18:55:18] yes, firefox 22 [18:55:24] * Jasper_Deng mentions he got that issue too [18:55:25] <^demon|lunch> RoanKattouw: I flushed all the on-disk caches this morning. [18:55:55] OK [18:56:30] I get the erroor at https://gerrit.wikimedia.org/r/#/c/69341/ [18:56:35] Pastebinning exception [18:57:19] Also seeing a number of 404s for avatars, not sure if that's related [18:57:25] <^demon> No [18:57:30] <^demon> That's been there since 2.6-rc1 [18:57:40] <^demon> Stupidly implemented feature. [18:57:47] Wow [18:57:57] Of course, because that's how the web works. Good job guys [18:58:05] Exception message: http://pastebin.com/kPitVBvK [18:59:03] <^demon> Have I badmouthed gwt yet today? No? F'ing gwt. [18:59:15] <^demon> RoanKattouw: That is quite possibly the least useful stacktrace ever. [18:59:43] Yes, exactly [18:59:44] RoanKattouw: Could you try running the query in gwt's debug mode? [18:59:47] Is there some sort of debug mode? [18:59:52] How do I trigger it? [19:00:10] odd the url parameter dbg with value 1 [19:00:50] https://gerrit.wikimedia.org/r/?dbg=1#/c/69341/ [19:01:20] Of course it works fine in debug mode ^_^ [19:01:32] Harrr :-( [19:01:40] <^demon> Rage. [19:02:09] ^demon: works again for me, in Iceweasel [19:03:23] <^demon> Why can't everyone just use webkit? ;-) [19:10:44] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [19:11:55] I could not get it to give me a projects list [19:12:18] New patchset: MaxSem; "Alternative way of setting resource paths" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71381 [19:12:35] ^demon: Because I want my browser to draw things correctly ;) [19:13:13] what makes user l10nupdate on mw* boxes have /bin/false as shell? where is that defined? [19:13:27] modules/mediawiki/manifests/users/l10nupdate.pp but i don't see it mentioning the shell [19:13:37] <^demon> mutante: Doesn't systemuser define /bin/false as the default shell? [19:14:19] yes [19:14:32] unless overridden [19:14:45] ^demon: that makes a lot of sense, yeah, it's a Systemuser [19:14:57] ah if it doesn't set it explictly, yes [19:15:01] well, it seems we have a broken cron job that apergos was debugging [19:15:04] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [19:15:05] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [19:15:05] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [19:15:05] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [19:15:05] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [19:15:05] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [19:15:05] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [19:15:05] but this used to work? [19:15:05] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [19:15:06] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [19:15:08] anomie: [19:15:10] that is broken because this can't run things without a shell [19:15:18] if i'd give it a bash, it works [19:15:27] New review: awjrichards; "LGTM!" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/71381 [19:16:33] 11:48 < apergos> ./files/misc/l10nupdate/l10nupdate-1:II$BINDIR/sync-l10nupdate-1 "$mwVerNum" [19:16:36] 11:48 < apergos> which turns out to be a dsh to the mw installation hosts, with that user name, and then run a command [19:17:00] yes, anomie reported the issue and did a bunch of checking on it earlier [19:17:03] <-- so running that won't work currently [19:17:40] apergos, mutante: Yeah, it used to work, until relatively recently. [19:17:44] this sounds like it just gets the MW version from all boxes? [19:17:55] any notion of around when it stopped working? [19:18:09] hmm [19:18:37] My guess is some time this week. Scaps during the week would have covered for it, but then the weekend came around and the translatewiki people noticed the translations weren't being updated. [19:20:30] Date: Mon May 20 13:43:57 2013 -0700 [19:20:31] add ssh keypair for l10nupdate user in deployment.pp for RT-5187 [19:20:51] already looked at it [19:20:56] on that day i took the key from fenari and .. [19:20:57] ok [19:20:59] don't see how that would have done it [19:21:09] also that's over a month back [19:21:10] yea, no [19:21:20] odd, doesn't look like it's been touched [19:50:14] Hm. Oddly apt typo, I accidentally refered to snapshots of the corrupted filesystem as "snapshits" :-) [19:50:24] :-D [19:50:48] that's probably a good note for me to afk on... have a better evening [19:50:55] and let's hope tomorrow sucks a lot less! [19:51:38] It'd take some effort for it to suck more. :-) [19:51:44] New review: Faidon; "AzaToth: that's not actually correct. debian/patches is preferred nowadays (and is, in fact, a requi..." [operations/debs/jmxtrans] (debian) C: -1; - https://gerrit.wikimedia.org/r/71079 [19:51:46] * Coren pretends he didn't say that. [19:51:48] dooon't [19:51:51] tempt [19:51:52] fate. [20:07:15] hasharCall: Jenkins seems to be broken :( [20:07:20] Krinkle: ---^^ [20:08:48] RoanKattouw: BAsed on what? [20:08:48] https://integration.wikimedia.org/zuul/ [20:09:09] There's jobs being spawned from what I can see, no errors [20:09:16] A large quee though, seems rush hour [20:09:25] but not stalled [20:09:25] Krinkle: https://gerrit.wikimedia.org/r/#/c/71512/ and lots of other changes where it hasn't run [20:09:27] Hmm, OK [20:09:36] I don't see it spawning VE jobs though [20:09:40] "Queue lengths: 347 events" [20:09:42] Yes [20:09:44] FIFO :) [20:09:51] Wait, *full* FIFO? [20:09:53] https://gerrit.wikimedia.org/r/#/c/71505/ ran now [20:10:00] RoanKattouw: nono, it is parallel [20:10:06] It's running gate-and-submit fine [20:10:13] But test isn't running for my new change [20:10:16] and it skips things with lookahead that dont need jenkins jobs (e.g. comments that don't contain "recheck") [20:10:17] !log updated Parsoid to 55d011d [20:10:24] Oh, fun [20:10:26] Logged the message, Master [20:10:31] This lookahead behavior is O(n^2), I bet? :( [20:10:44] the lookahead is to make it quicker, not slower. [20:10:49] RoanKattouw: busy processing https://integration.wikimedia.org/zuul/ :D [20:10:52] and it is working in a good way [20:10:55] I do't know the formula [20:11:06] RoanKattouw: Anyway, as long as the queue is non-empty and there are no visible errors, there's not much we can do. [20:11:12] OK [20:11:32] New review: Aklapper; "As explaining that the approach in the patch is unacceptable does not describe which options are con..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [20:21:54] mutante, apergos: So, since the /bin/false shell is the problem, can we get an appropriate shell set? Or is there some other way for sync-l10nupdate-1 to work? [20:30:32] this is a really useful puppet pattern for /etc/foo.d-type directories: http://christian.hofstaedtler.name/blog/2008/11/puppet-managing-directories-recursively.html [20:30:51] i knew about these options individually but wasn't sure how to put them together in this way before [20:31:06] sharing here in case it's useful for anyone else [20:32:16] i'm going to use it to have puppet manage a mediawiki.d/puppet-managed folder which LocalSettings.php will glob and include [20:32:20] be careful with that tool [20:32:25] I've used it extensively in the past [20:32:28] it can get dangerous [20:32:51] by that I mean cascading failures that need multiple cycles to be fixed [20:33:30] caused by what? other tools / people writing to a recursively managed directory w/purge => true, only to have their files obliterated? [20:33:57] that [20:34:13] or, even worse, populating such a dir with exported resources [20:34:21] heh [20:35:27] Krinkle: the huge Zuul slowness spike that happens at that hour is due to the l10n bot sending a myriad of patch sets. That is a bunch of events, maybe up to 300. [20:35:29] !log Zuul was clogging up with ~ 350 queued up events. Added 1 extra executor on Jenkins master and slave. [20:35:38] Logged the message, Master [20:35:49] hashar: Indeed, that makes sense. [20:36:02] Krinkle: and I probably found a nasty cause of slowness in Zuul [20:36:02] Too bad it was in a middle of a deployment window [20:36:18] New patchset: Ryan Lane; "Add rsvg to OSM nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71519 [20:36:53] Krinkle: it does update the description of all jobs part of a change whenever an event happens. So for mw/core which is maybe 9 jobs, whenever a job start Zuul updates the 8 other job descriptions. Same happens whenever a job success / fails etc :) [20:36:53] I guess I should spell out OSM [20:36:53] hashar: Having it kill the test pipeline if gate-and-submit it triggered would save like 50% [20:37:05] New patchset: Ryan Lane; "Add rsvg to OpenStackManager nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71519 [20:37:21] hashar: ... during l10nbot self-merging stuff that is. [20:37:49] or exclude it from `test` and let Jenkins merge :) But yeah filtering it out of `test`would be nice [20:37:54] haven't really looked at that [20:38:58] Krinkle: during busy hour, you can have a look at Jenkins access log and greping for Zuul user agent ( urllib ) : tail -f /var/log/jenkins/access.log | fgrep urllib [20:39:34] Krinkle: I suspect Zuul is blocking on those jobs descriptions API calls, but it might use a separate thread as well. Have to dig in the code [20:46:33] !log catrope synchronized php-1.22wmf9/extensions/VisualEditor 'Updating VE to master in wmf9' [20:46:42] Logged the message, Master [20:47:00] RoanKattouw: this is it? [20:47:15] AzaToth: This is the test to wmf9. [20:47:17] well, that was wmf9 [20:47:21] AzaToth: I.e., to MediaWiki.org [20:47:23] k [20:47:24] which is only on mediawiki.org/test/test2 [20:47:31] next (wmf8) will "be it" ;) [20:47:35] ah [20:47:38] AzaToth: Once we've confirmed there, we'll push to wmf8, test again, then throw the switch. [20:47:53] and throw the keys? [20:48:08] No key throwing, please. You could have someone's eye out. [20:48:09] :) [20:50:38] paravoid, hi, will you have time today to review varnish zero addition of a new carrier? we need to do a live test with them soonish [20:50:48] i am working on getting the change in [20:51:19] I'm about to call it a day [20:51:40] paravoid, should take me afew min [20:51:45] anyone else could do it? [20:52:04] I can wait a few min [20:52:16] paravoid, thx, will ping you in a sec [20:52:16] but pretty much everyone in our team could do it [20:54:51] !log catrope synchronized php-1.22wmf9/extensions/VisualEditor 'Updating VE to master in wmf9; for real this time' [20:54:55] yurik: i can get that [20:55:01] Logged the message, Master [20:55:06] paravoid: you can go crash out if you want [20:55:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71519 [20:57:22] still wmf9 [20:57:36] * greg-g is impatient [20:57:47] New patchset: Yurik; "Added 404-01 (Aircel India)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71521 [20:58:00] LeslieCarr, paravoid ^ [20:58:02] thx! [20:59:12] yurik: fyi, most opsen , including the one on rt duty, can fix this up [21:00:09] LeslieCarr: Topic is out of date ;) [21:00:37] oh, how'd you know who is on rt duty ? [21:00:41] i don't even know :) [21:00:58] LeslieCarr, its a random generation process ;) [21:01:01] !log catrope synchronized php-1.22wmf8/extensions/TimedMediaHandler/ 'Deploy TMH API cherry-pick to wmf8 (already in wmf9)' [21:01:10] Logged the message, Master [21:01:37] lol [21:01:42] I don't [21:01:48] I just read it as being Rob... [21:02:54] !log catrope synchronized php-1.22wmf8/extensions/VisualEditor/ 'Update VE to master in wmf8 (did wmf9 earlier)' [21:03:03] Logged the message, Master [21:04:05] yurik: good to merge ? [21:04:34] LeslieCarr, i hope so :) [21:04:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71521 [21:05:20] thanks LeslieCarr !! [21:05:40] YW [21:05:42] oops [21:05:43] yw [21:05:45] hey yurik [21:05:46] https://gerrit.wikimedia.org/r/40337 [21:06:02] argh, i guess this channel isn't quite appropriate, sorry :P [21:06:06] * yurik hides under last year's leafs [21:07:21] New patchset: Helder.wiki; "Install ArticleFeedbackv5 on pt.wikibooks.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71524 [21:10:22] New patchset: Dzahn; "let systemuser l10nupdate have bash as shell so it can run l10nupdate scripts via dsh on new mw* hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71525 [21:10:37] !change 71525 | anomie [21:10:37] anomie: https://gerrit.wikimedia.org/r/#q,71525,n,z [21:10:37] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [21:11:16] mutante: Looks sane to me [21:12:47] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [21:13:31] !log catrope Started syncing Wikimedia installation... : Scap for i18n changes in VE [21:13:41] Logged the message, Master [21:18:49] New patchset: Lcarr; "giving neon its own gmetad to reduce disk usage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71529 [21:21:00] New patchset: Lcarr; "giving neon its own gmetad to reduce disk usage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71529 [21:22:58] !log catrope Finished syncing Wikimedia installation... : Scap for i18n changes in VE [21:23:07] Logged the message, Master [21:23:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71529 [21:27:39] .... and we're live? [21:27:47] Almost [21:27:54] Code updates are out, about to switch the config over [21:27:58] * greg-g nods [21:31:38] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70652 [21:32:01] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71261 [21:33:41] !log updated Parsoid to 613287b [21:33:51] Logged the message, Master [21:35:14] !log catrope synchronized wmf-config/CommonSettings.php 'Plumbing for VE settings' [21:35:24] Logged the message, Master [21:36:44] !log catrope synchronized wmf-config/InitialiseSettings.php 'Enable VisualEditor for all logged-in users on enwiki' [21:36:52] ding ding ding [21:36:53] Logged the message, Master [21:37:32] wooooo, congrats [21:37:45] man, shit's so about to go down. [21:38:48] congrats! [21:41:38] New patchset: Anomie; "let systemuser l10nupdate have bash as shell so it can run l10nupdate scripts via dsh on new mw* hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71525 [21:43:11] Did anyone report the interactive map pop-up on enwiki Coordinates as broken? Several times http://toolserver.org/~dschwen/wma/wikiminiatlas_extern.php returned 502 Bad Gateway. [21:44:13] spagewmf: i got this 502 from time to time [21:44:49] matanya thanks. By the time I tried to repro in Chromium it was working but sluggish. [21:45:19] let me check a sec [21:45:58] anomie: but /bin/false as shell for systemusers hasn't changed recently; is there an explanation for why the issue started, well, being an issue now? [21:46:16] spagewmf: if it's on toolserver.org it's best to report in #wikimedia-toolserver [21:46:32] ori-l: No idea. Maybe it was broken since early May and no one noticed until now. [21:46:33] will do [21:46:33] pasting that [21:46:49] anomie: ah, fair enough [21:46:58] (I find that a bit unlikely, but it's certainly possible) [21:47:52] spagewmf: not responding, but no gateway error, i'd recommend moving this to the labs [21:47:52] ori-l: yea, i guess that theory is the likely one [21:48:12] but on the other hand, we did add the ssh key for that user back then [21:48:24] and it sounded resolved [21:48:25] mutante: what is the process of merging a fix, do you know? [21:48:50] ori-l: by "add" i meant taking it from fenari and copying it to tin [21:49:05] matanya: fix for which product or repo? [21:49:13] core [21:49:23] operations i think i know [21:50:17] matanya: http://www.mediawiki.org/wiki/Gerrit/Tutorial#How_to_submit_a_patch [21:50:21] spagewmf: Toolserver is having issues in general. gateway errors have been on/off for almost 3 days now [21:50:40] mutante: it seems like a reasonable thing to try, but i'd still note the ambiguity in the bug so that it isn't misleading [21:53:03] thanks mutante my question was about internal policy of merging not technical how to, i did submit patches in the past :) [21:54:05] matanya: ah, i see, well dev people know better [21:54:19] thank you [21:54:26] i'm pretty sure it depends how important the fix is [21:54:39] if it's security related you should mail security first [21:56:01] New patchset: Mattflaschen; "Enable gender survey on English Wikipedia." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71533 [21:58:02] "The one exception is localisation and internationalisation commits, which will be able to be pushed without review." [21:58:51] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [22:01:49] New patchset: Mattflaschen; "Remove trailing whitespace." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71534 [22:02:10] New patchset: Catrope; "Give Gabriel shell on the Parsoid Varnishes again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/71535 [22:02:46] New patchset: Anomie; "Update protection configs for core change I6bf650a3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71538 [22:03:47] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71533 [22:05:17] New patchset: Mattflaschen; "Remove enwiki override for wgVisualEditorEnableSplitTest." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71540 [22:09:22] New review: Spage; "White space only, php -l OK." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/71534 [22:09:32] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71534 [22:13:53] New patchset: Mattflaschen; "Enable gender survey on test and test2." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71541 [22:16:26] New review: Spage; "thanks" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/71541 [22:16:37] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71541 [22:17:40] !log restarting neon for new kernel [22:17:49] Logged the message, Mistress of the network gear. [22:25:35] LeslieCarr: is that specifically programmed to call you that? [22:25:52] YuviPanda: yes [22:26:08] :D very nice! [22:26:30] * YuviPanda remembers to start using that in regular IRL conversations next time he comes across LeslieCarr [22:26:36] that, and 'sudo santa' [22:26:56] hehe [22:39:46] ok, for some awful reason, icinga isn't reading the snmp results [22:40:02] LeslieCarr: after reboot, right [22:40:07] also before reboot [22:41:47] lemme clean the spool dir once more [22:42:14] snmptt ? [22:42:17] !log deleting files from snmptt spool dir on neon [22:42:18] yea [22:42:26] Logged the message, Master [22:42:43] so why is snmptt letting millions of files stack up [22:42:48] this has been happening a lot the last few weeks [22:42:51] but not before.. [22:43:18] i have thix quick fix pending: https://gerrit.wikimedia.org/r/#/c/71149/2/manifests/misc/icinga.pp but not the root cause [22:44:14] maybe it crashes, restarts but then forgot about old files it would usually clean up? [22:44:24] because it also doesn't seem to do this consistently [22:44:37] like when ariel and i took random looks at it, sometimes it was fine after a restart [22:44:50] also note the existing older FIXME about it crashing [22:45:01] oh [22:45:03] i wonder [22:45:21] /var/spool/snmptt is root:root [22:45:30] but snmptt is both root and user snmptt [22:46:32] oh [22:46:37] yeah, it spawns a child [22:46:37] did you delete the whole directory before ? [22:46:38] RECOVERY - Puppet freshness on mw35 is OK: puppet ran at Mon Jul 1 22:46:32 UTC 2013 [22:46:38] RECOVERY - Puppet freshness on mw98 is OK: puppet ran at Mon Jul 1 22:46:32 UTC 2013 [22:46:38] RECOVERY - Puppet freshness on mw1046 is OK: puppet ran at Mon Jul 1 22:46:32 UTC 2013 [22:46:38] RECOVERY - Puppet freshness on mw106 is OK: puppet ran at Mon Jul 1 22:46:32 UTC 2013 [22:46:38] RECOVERY - Puppet freshness on db1034 is OK: puppet ran at Mon Jul 1 22:46:33 UTC 2013 [22:46:39] ├─snmptt /usr/sbin/snmptt --daemon │ └─snmptt,snmptt /usr/sbin/snmptt --daemon [22:46:39] and recreate it ? [22:47:15] because that could totall y be it.. it freaked out once, then was unable to clean out the directory afterwards [22:47:17] hmm.. at least not the last 2 times or so [22:47:34] but still could be it [22:47:35] well, at least it's happy now [22:47:48] RECOVERY - Puppet freshness on srv262 is OK: puppet ran at Mon Jul 1 22:47:42 UTC 2013 [22:47:48] RECOVERY - Puppet freshness on analytics1009 is OK: puppet ran at Mon Jul 1 22:47:42 UTC 2013 [22:47:48] RECOVERY - Puppet freshness on db77 is OK: puppet ran at Mon Jul 1 22:47:42 UTC 2013 [22:47:48] RECOVERY - Puppet freshness on mc4 is OK: puppet ran at Mon Jul 1 22:47:42 UTC 2013 [22:47:48] RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Mon Jul 1 22:47:47 UTC 2013 [22:47:49] RECOVERY - Puppet freshness on ms-fe2 is OK: puppet ran at Mon Jul 1 22:47:47 UTC 2013 [22:47:49] RECOVERY - Puppet freshness on mw13 is OK: puppet ran at Mon Jul 1 22:47:47 UTC 2013 [22:47:50] RECOVERY - Puppet freshness on sq68 is OK: puppet ran at Mon Jul 1 22:47:47 UTC 2013 [22:47:58] RECOVERY - Puppet freshness on db1058 is OK: puppet ran at Mon Jul 1 22:47:52 UTC 2013 [22:47:58] RECOVERY - Puppet freshness on ms-be1007 is OK: puppet ran at Mon Jul 1 22:47:52 UTC 2013 [22:47:58] RECOVERY - Puppet freshness on db65 is OK: puppet ran at Mon Jul 1 22:47:52 UTC 2013 [22:47:58] RECOVERY - Puppet freshness on analytics1004 is OK: puppet ran at Mon Jul 1 22:47:52 UTC 2013 [22:47:58] did you change permissions ? [22:47:59] RECOVERY - Puppet freshness on wtp1002 is OK: puppet ran at Mon Jul 1 22:47:57 UTC 2013 [22:47:59] RECOVERY - Puppet freshness on cp1038 is OK: puppet ran at Mon Jul 1 22:47:57 UTC 2013 [22:48:08] RECOVERY - Puppet freshness on mw1020 is OK: puppet ran at Mon Jul 1 22:48:02 UTC 2013 [22:48:08] RECOVERY - Puppet freshness on mw1179 is OK: puppet ran at Mon Jul 1 22:48:02 UTC 2013 [22:48:08] RECOVERY - Puppet freshness on search1015 is OK: puppet ran at Mon Jul 1 22:48:02 UTC 2013 [22:48:13] i restarted snmptt and snmptrapd after the spool dir had been cleaned [22:48:18] RECOVERY - Puppet freshness on srv272 is OK: puppet ran at Mon Jul 1 22:48:13 UTC 2013 [22:48:18] RECOVERY - Puppet freshness on mw99 is OK: puppet ran at Mon Jul 1 22:48:13 UTC 2013 [22:48:18] RECOVERY - Puppet freshness on srv264 is OK: puppet ran at Mon Jul 1 22:48:13 UTC 2013 [22:48:18] RECOVERY - Puppet freshness on mw39 is OK: puppet ran at Mon Jul 1 22:48:13 UTC 2013 [22:48:18] RECOVERY - Puppet freshness on mw1169 is OK: puppet ran at Mon Jul 1 22:48:13 UTC 2013 [22:48:28] RECOVERY - Puppet freshness on mw1053 is OK: puppet ran at Mon Jul 1 22:48:18 UTC 2013 [22:48:28] RECOVERY - Puppet freshness on search1023 is OK: puppet ran at Mon Jul 1 22:48:18 UTC 2013 [22:48:28] RECOVERY - Puppet freshness on mw95 is OK: puppet ran at Mon Jul 1 22:48:18 UTC 2013 [22:48:28] RECOVERY - Puppet freshness on ms-be1009 is OK: puppet ran at Mon Jul 1 22:48:18 UTC 2013 [22:48:28] RECOVERY - Puppet freshness on search1021 is OK: puppet ran at Mon Jul 1 22:48:18 UTC 2013 [22:48:29] RECOVERY - Puppet freshness on mw1198 is OK: puppet ran at Mon Jul 1 22:48:18 UTC 2013 [22:48:29] RECOVERY - Puppet freshness on mw1083 is OK: puppet ran at Mon Jul 1 22:48:18 UTC 2013 [22:48:30] RECOVERY - Puppet freshness on mw1136 is OK: puppet ran at Mon Jul 1 22:48:18 UTC 2013 [22:48:30] RECOVERY - Puppet freshness on mw1094 is OK: puppet ran at Mon Jul 1 22:48:18 UTC 2013 [22:48:31] RECOVERY - Puppet freshness on mw1116 is OK: puppet ran at Mon Jul 1 22:48:18 UTC 2013 [22:48:31] RECOVERY - Puppet freshness on mw1074 is OK: puppet ran at Mon Jul 1 22:48:23 UTC 2013 [22:48:32] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Mon Jul 1 22:48:23 UTC 2013 [22:48:36] !log restarted snmptt and snmptrapd on neon [22:48:38] RECOVERY - Puppet freshness on analytics1008 is OK: puppet ran at Mon Jul 1 22:48:28 UTC 2013 [22:48:38] RECOVERY - Puppet freshness on sq50 is OK: puppet ran at Mon Jul 1 22:48:28 UTC 2013 [22:48:38] RECOVERY - Puppet freshness on db57 is OK: puppet ran at Mon Jul 1 22:48:28 UTC 2013 [22:48:38] RECOVERY - Puppet freshness on db1024 is OK: puppet ran at Mon Jul 1 22:48:33 UTC 2013 [22:48:48] Logged the message, Master [22:48:48] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Mon Jul 1 22:48:38 UTC 2013 [22:48:48] RECOVERY - Puppet freshness on es5 is OK: puppet ran at Mon Jul 1 22:48:38 UTC 2013 [22:48:48] RECOVERY - Puppet freshness on cp1065 is OK: puppet ran at Mon Jul 1 22:48:38 UTC 2013 [22:48:48] RECOVERY - Puppet freshness on mc1008 is OK: puppet ran at Mon Jul 1 22:48:38 UTC 2013 [22:48:48] RECOVERY - Puppet freshness on db51 is OK: puppet ran at Mon Jul 1 22:48:43 UTC 2013 [22:48:49] RECOVERY - Puppet freshness on wtp1013 is OK: puppet ran at Mon Jul 1 22:48:43 UTC 2013 [22:48:58] RECOVERY - Puppet freshness on stat1002 is OK: puppet ran at Mon Jul 1 22:48:48 UTC 2013 [22:48:58] RECOVERY - Puppet freshness on mw10 is OK: puppet ran at Mon Jul 1 22:48:49 UTC 2013 [22:48:58] RECOVERY - Puppet freshness on mw117 is OK: puppet ran at Mon Jul 1 22:48:49 UTC 2013 [22:48:58] RECOVERY - Puppet freshness on mw1066 is OK: puppet ran at Mon Jul 1 22:48:54 UTC 2013 [22:48:58] RECOVERY - Puppet freshness on mw33 is OK: puppet ran at Mon Jul 1 22:48:54 UTC 2013 [22:48:59] RECOVERY - Puppet freshness on mw26 is OK: puppet ran at Mon Jul 1 22:48:54 UTC 2013 [22:48:59] RECOVERY - Puppet freshness on mw36 is OK: puppet ran at Mon Jul 1 22:48:54 UTC 2013 [22:49:08] RECOVERY - Puppet freshness on mw1143 is OK: puppet ran at Mon Jul 1 22:48:59 UTC 2013 [22:49:08] RECOVERY - Puppet freshness on mw64 is OK: puppet ran at Mon Jul 1 22:48:59 UTC 2013 [22:49:08] RECOVERY - Puppet freshness on analytics1011 is OK: puppet ran at Mon Jul 1 22:49:04 UTC 2013 [22:49:08] RECOVERY - Puppet freshness on analytics1019 is OK: puppet ran at Mon Jul 1 22:49:04 UTC 2013 [22:49:28] RECOVERY - Puppet freshness on tarin is OK: puppet ran at Mon Jul 1 22:49:24 UTC 2013 [22:49:38] RECOVERY - Puppet freshness on db46 is OK: puppet ran at Mon Jul 1 22:49:29 UTC 2013 [22:49:38] RECOVERY - Puppet freshness on mc16 is OK: puppet ran at Mon Jul 1 22:49:29 UTC 2013 [22:49:38] RECOVERY - Puppet freshness on mw1208 is OK: puppet ran at Mon Jul 1 22:49:29 UTC 2013 [22:49:38] RECOVERY - Puppet freshness on db55 is OK: puppet ran at Mon Jul 1 22:49:34 UTC 2013 [22:49:38] RECOVERY - Puppet freshness on labsdb1001 is OK: puppet ran at Mon Jul 1 22:49:34 UTC 2013 [22:49:48] RECOVERY - Puppet freshness on snapshot4 is OK: puppet ran at Mon Jul 1 22:49:39 UTC 2013 [22:49:48] RECOVERY - Puppet freshness on analytics1022 is OK: puppet ran at Mon Jul 1 22:49:39 UTC 2013 [22:49:48] RECOVERY - Puppet freshness on solr1001 is OK: puppet ran at Mon Jul 1 22:49:40 UTC 2013 [22:49:48] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Mon Jul 1 22:49:40 UTC 2013 [22:49:48] RECOVERY - Puppet freshness on ssl4 is OK: puppet ran at Mon Jul 1 22:49:45 UTC 2013 [22:49:49] RECOVERY - Puppet freshness on cp1023 is OK: puppet ran at Mon Jul 1 22:49:45 UTC 2013 [22:49:49] RECOVERY - Puppet freshness on wtp1021 is OK: puppet ran at Mon Jul 1 22:49:45 UTC 2013 [22:49:50] RECOVERY - Puppet freshness on es1009 is OK: puppet ran at Mon Jul 1 22:49:45 UTC 2013 [22:49:50] RECOVERY - Puppet freshness on db67 is OK: puppet ran at Mon Jul 1 22:49:45 UTC 2013 [22:49:55] New patchset: Se4598; "Disable AFTv5 feedback submission on dewiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71546 [22:49:58] RECOVERY - Puppet freshness on mw109 is OK: puppet ran at Mon Jul 1 22:49:50 UTC 2013 [22:49:58] RECOVERY - Puppet freshness on search1005 is OK: puppet ran at Mon Jul 1 22:49:50 UTC 2013 [22:49:58] RECOVERY - Puppet freshness on mw70 is OK: puppet ran at Mon Jul 1 22:49:50 UTC 2013 [22:49:58] RECOVERY - Puppet freshness on srv296 is OK: puppet ran at Mon Jul 1 22:49:50 UTC 2013 [22:49:58] RECOVERY - Puppet freshness on mw53 is OK: puppet ran at Mon Jul 1 22:49:55 UTC 2013 [22:49:59] RECOVERY - Puppet freshness on mw1185 is OK: puppet ran at Mon Jul 1 22:49:55 UTC 2013 [22:49:59] RECOVERY - Puppet freshness on mw1218 is OK: puppet ran at Mon Jul 1 22:49:55 UTC 2013 [22:50:00] RECOVERY - Puppet freshness on mw1001 is OK: puppet ran at Mon Jul 1 22:49:55 UTC 2013 [22:50:00] RECOVERY - Puppet freshness on mw1062 is OK: puppet ran at Mon Jul 1 22:49:55 UTC 2013 [22:50:08] RECOVERY - Puppet freshness on mw1022 is OK: puppet ran at Mon Jul 1 22:50:00 UTC 2013 [22:50:08] RECOVERY - Puppet freshness on mw1132 is OK: puppet ran at Mon Jul 1 22:50:00 UTC 2013 [22:50:08] RECOVERY - Puppet freshness on mw1040 is OK: puppet ran at Mon Jul 1 22:50:00 UTC 2013 [22:50:11] New review: Se4598; "DO NO SUBMIT/MERGE RIGHT NOW!" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/71546 [22:50:15] LeslieCarr: could have also been snmptrapd restart now.. too many daemons != KISS :p [22:50:33] owner [22:50:38] RECOVERY - Puppet freshness on sq55 is OK: puppet ran at Mon Jul 1 22:50:30 UTC 2013 [22:50:38] RECOVERY - Puppet freshness on sq81 is OK: puppet ran at Mon Jul 1 22:50:30 UTC 2013 [22:50:38] RECOVERY - Puppet freshness on sq64 is OK: puppet ran at Mon Jul 1 22:50:30 UTC 2013 [22:50:38] RECOVERY - Puppet freshness on mc1011 is OK: puppet ran at Mon Jul 1 22:50:30 UTC 2013 [22:50:38] RECOVERY - Puppet freshness on amssq59 is OK: puppet ran at Mon Jul 1 22:50:30 UTC 2013 [22:50:39] RECOVERY - Puppet freshness on aluminium is OK: puppet ran at Mon Jul 1 22:50:30 UTC 2013 [22:50:39] RECOVERY - Puppet freshness on ms-fe1003 is OK: puppet ran at Mon Jul 1 22:50:35 UTC 2013 [22:50:48] RECOVERY - Puppet freshness on amssq40 is OK: puppet ran at Mon Jul 1 22:50:41 UTC 2013 [22:50:48] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Mon Jul 1 22:50:41 UTC 2013 [22:50:48] RECOVERY - Puppet freshness on mw1138 is OK: puppet ran at Mon Jul 1 22:50:46 UTC 2013 [22:50:48] RECOVERY - Puppet freshness on srv285 is OK: puppet ran at Mon Jul 1 22:50:46 UTC 2013 [22:50:48] RECOVERY - Puppet freshness on cp1025 is OK: puppet ran at Mon Jul 1 22:50:46 UTC 2013 [22:50:58] RECOVERY - Puppet freshness on mw111 is OK: puppet ran at Mon Jul 1 22:50:51 UTC 2013 [22:50:58] RECOVERY - Puppet freshness on mw1080 is OK: puppet ran at Mon Jul 1 22:50:51 UTC 2013 [22:50:58] RECOVERY - Puppet freshness on search1011 is OK: puppet ran at Mon Jul 1 22:50:51 UTC 2013 [22:50:58] RECOVERY - Puppet freshness on mw1219 is OK: puppet ran at Mon Jul 1 22:50:51 UTC 2013 [22:50:58] RECOVERY - Puppet freshness on cp3006 is OK: puppet ran at Mon Jul 1 22:50:51 UTC 2013 [22:50:59] RECOVERY - Puppet freshness on mw1072 is OK: puppet ran at Mon Jul 1 22:50:56 UTC 2013 [22:50:59] RECOVERY - Puppet freshness on search1006 is OK: puppet ran at Mon Jul 1 22:50:56 UTC 2013 [22:51:00] RECOVERY - Puppet freshness on mw1031 is OK: puppet ran at Mon Jul 1 22:50:56 UTC 2013 [22:51:08] RECOVERY - Puppet freshness on mw87 is OK: puppet ran at Mon Jul 1 22:51:01 UTC 2013 [22:51:08] RECOVERY - Puppet freshness on db31 is OK: puppet ran at Mon Jul 1 22:51:01 UTC 2013 [22:51:08] RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Mon Jul 1 22:51:01 UTC 2013 [22:51:08] RECOVERY - Puppet freshness on ms-be10 is OK: puppet ran at Mon Jul 1 22:51:01 UTC 2013 [22:51:08] RECOVERY - Puppet freshness on professor is OK: puppet ran at Mon Jul 1 22:51:01 UTC 2013 [22:51:09] RECOVERY - Puppet freshness on mc2 is OK: puppet ran at Mon Jul 1 22:51:06 UTC 2013 [22:51:48] RECOVERY - Puppet freshness on mw1093 is OK: puppet ran at Mon Jul 1 22:51:46 UTC 2013 [22:51:58] RECOVERY - Puppet freshness on sq78 is OK: puppet ran at Mon Jul 1 22:51:51 UTC 2013 [22:51:58] RECOVERY - Puppet freshness on search27 is OK: puppet ran at Mon Jul 1 22:51:51 UTC 2013 [22:51:58] RECOVERY - Puppet freshness on sq82 is OK: puppet ran at Mon Jul 1 22:51:51 UTC 2013 [22:51:58] RECOVERY - Puppet freshness on cp1021 is OK: puppet ran at Mon Jul 1 22:51:51 UTC 2013 [22:51:58] RECOVERY - Puppet freshness on magnesium is OK: puppet ran at Mon Jul 1 22:51:51 UTC 2013 [22:52:22] New patchset: Aude; "update Wikibase setting variables" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71547 [22:52:34] LeslieCarr: gotcha, and the dir is created by the deb package, ack [22:53:08] RECOVERY - Puppet freshness on sq59 is OK: puppet ran at Mon Jul 1 22:52:57 UTC 2013 [22:53:08] RECOVERY - Puppet freshness on db59 is OK: puppet ran at Mon Jul 1 22:52:57 UTC 2013 [22:53:08] RECOVERY - Puppet freshness on db44 is OK: puppet ran at Mon Jul 1 22:52:57 UTC 2013 [22:53:08] RECOVERY - Puppet freshness on db1045 is OK: puppet ran at Mon Jul 1 22:52:57 UTC 2013 [22:53:08] RECOVERY - Puppet freshness on ms-be1003 is OK: puppet ran at Mon Jul 1 22:52:57 UTC 2013 [22:53:08] RECOVERY - Puppet freshness on sq37 is OK: puppet ran at Mon Jul 1 22:52:57 UTC 2013 [22:53:09] RECOVERY - Puppet freshness on db56 is OK: puppet ran at Mon Jul 1 22:52:58 UTC 2013 [22:53:09] RECOVERY - Puppet freshness on mw1203 is OK: puppet ran at Mon Jul 1 22:52:58 UTC 2013 [22:53:10] RECOVERY - Puppet freshness on cp1054 is OK: puppet ran at Mon Jul 1 22:52:58 UTC 2013 [22:53:10] RECOVERY - Puppet freshness on cp1053 is OK: puppet ran at Mon Jul 1 22:52:58 UTC 2013 [22:53:11] RECOVERY - Puppet freshness on tmh1 is OK: puppet ran at Mon Jul 1 22:52:58 UTC 2013 [22:53:11] RECOVERY - Puppet freshness on mc1006 is OK: puppet ran at Mon Jul 1 22:52:58 UTC 2013 [22:53:12] RECOVERY - Puppet freshness on stat1001 is OK: puppet ran at Mon Jul 1 22:52:58 UTC 2013 [22:53:12] RECOVERY - Puppet freshness on ssl3001 is OK: puppet ran at Mon Jul 1 22:52:58 UTC 2013 [22:53:13] RECOVERY - Puppet freshness on lvs6 is OK: puppet ran at Mon Jul 1 22:52:58 UTC 2013 [22:53:13] RECOVERY - Puppet freshness on es1008 is OK: puppet ran at Mon Jul 1 22:52:58 UTC 2013 [22:53:14] RECOVERY - Puppet freshness on mw1220 is OK: puppet ran at Mon Jul 1 22:53:03 UTC 2013 [22:53:14] RECOVERY - Puppet freshness on arsenic is OK: puppet ran at Mon Jul 1 22:53:03 UTC 2013 [22:53:15] RECOVERY - Puppet freshness on mw1145 is OK: puppet ran at Mon Jul 1 22:53:03 UTC 2013 [22:53:15] RECOVERY - Puppet freshness on mw1112 is OK: puppet ran at Mon Jul 1 22:53:03 UTC 2013 [22:53:16] RECOVERY - Puppet freshness on db1057 is OK: puppet ran at Mon Jul 1 22:53:03 UTC 2013 [22:53:16] RECOVERY - Puppet freshness on db1056 is OK: puppet ran at Mon Jul 1 22:53:03 UTC 2013 [22:53:17] RECOVERY - Puppet freshness on db36 is OK: puppet ran at Mon Jul 1 22:53:03 UTC 2013 [22:53:18] RECOVERY - Puppet freshness on ssl1002 is OK: puppet ran at Mon Jul 1 22:53:08 UTC 2013 [22:53:18] RECOVERY - Puppet freshness on linne is OK: puppet ran at Mon Jul 1 22:53:08 UTC 2013 [22:53:28] RECOVERY - Puppet freshness on mw1135 is OK: puppet ran at Mon Jul 1 22:53:18 UTC 2013 [22:53:28] RECOVERY - Puppet freshness on chromium is OK: puppet ran at Mon Jul 1 22:53:18 UTC 2013 [22:53:28] RECOVERY - Puppet freshness on search35 is OK: puppet ran at Mon Jul 1 22:53:18 UTC 2013 [22:53:28] RECOVERY - Puppet freshness on mw1 is OK: puppet ran at Mon Jul 1 22:53:18 UTC 2013 [22:53:28] RECOVERY - Puppet freshness on analytics1017 is OK: puppet ran at Mon Jul 1 22:53:18 UTC 2013 [22:53:28] RECOVERY - Puppet freshness on cp1068 is OK: puppet ran at Mon Jul 1 22:53:23 UTC 2013 [22:53:29] RECOVERY - Puppet freshness on srv271 is OK: puppet ran at Mon Jul 1 22:53:23 UTC 2013 [22:53:29] RECOVERY - Puppet freshness on mw1026 is OK: puppet ran at Mon Jul 1 22:53:23 UTC 2013 [22:53:30] RECOVERY - Puppet freshness on mw1174 is OK: puppet ran at Mon Jul 1 22:53:23 UTC 2013 [22:53:30] RECOVERY - Puppet freshness on mw59 is OK: puppet ran at Mon Jul 1 22:53:23 UTC 2013 [22:53:31] RECOVERY - Puppet freshness on mw15 is OK: puppet ran at Mon Jul 1 22:53:23 UTC 2013 [22:53:31] RECOVERY - Puppet freshness on mw1110 is OK: puppet ran at Mon Jul 1 22:53:23 UTC 2013 [22:53:32] RECOVERY - Puppet freshness on mw1082 is OK: puppet ran at Mon Jul 1 22:53:23 UTC 2013 [22:53:38] RECOVERY - Puppet freshness on mw1059 is OK: puppet ran at Mon Jul 1 22:53:28 UTC 2013 [22:53:38] RECOVERY - Puppet freshness on mw52 is OK: puppet ran at Mon Jul 1 22:53:28 UTC 2013 [22:53:38] RECOVERY - Puppet freshness on mw1121 is OK: puppet ran at Mon Jul 1 22:53:28 UTC 2013 [22:53:38] RECOVERY - Puppet freshness on wtp1006 is OK: puppet ran at Mon Jul 1 22:53:33 UTC 2013 [22:53:38] RECOVERY - Puppet freshness on analytics1018 is OK: puppet ran at Mon Jul 1 22:53:33 UTC 2013 [22:53:38] RECOVERY - Puppet freshness on mw1086 is OK: puppet ran at Mon Jul 1 22:53:33 UTC 2013 [22:53:39] RECOVERY - Puppet freshness on sq77 is OK: puppet ran at Mon Jul 1 22:53:33 UTC 2013 [22:53:48] RECOVERY - Puppet freshness on ms1004 is OK: puppet ran at Mon Jul 1 22:53:39 UTC 2013 [22:53:48] RECOVERY - Puppet freshness on sq49 is OK: puppet ran at Mon Jul 1 22:53:39 UTC 2013 [22:53:48] RECOVERY - Puppet freshness on mc13 is OK: puppet ran at Mon Jul 1 22:53:44 UTC 2013 [22:53:48] RECOVERY - Puppet freshness on wtp1020 is OK: puppet ran at Mon Jul 1 22:53:44 UTC 2013 [22:53:58] RECOVERY - Puppet freshness on cp1039 is OK: puppet ran at Mon Jul 1 22:53:49 UTC 2013 [22:53:58] RECOVERY - Puppet freshness on cp3019 is OK: puppet ran at Mon Jul 1 22:53:49 UTC 2013 [22:53:58] RECOVERY - Puppet freshness on cp3021 is OK: puppet ran at Mon Jul 1 22:53:49 UTC 2013 [22:53:58] RECOVERY - Puppet freshness on virt8 is OK: puppet ran at Mon Jul 1 22:53:54 UTC 2013 [22:54:08] RECOVERY - Puppet freshness on wtp1016 is OK: puppet ran at Mon Jul 1 22:53:59 UTC 2013 [22:54:08] RECOVERY - Puppet freshness on mw91 is OK: puppet ran at Mon Jul 1 22:53:59 UTC 2013 [22:54:08] RECOVERY - Puppet freshness on emery is OK: puppet ran at Mon Jul 1 22:53:59 UTC 2013 [22:54:08] RECOVERY - Puppet freshness on analytics1020 is OK: puppet ran at Mon Jul 1 22:54:04 UTC 2013 [22:54:08] RECOVERY - Puppet freshness on mw1158 is OK: puppet ran at Mon Jul 1 22:54:04 UTC 2013 [22:54:08] RECOVERY - Puppet freshness on srv268 is OK: puppet ran at Mon Jul 1 22:54:04 UTC 2013 [22:54:09] RECOVERY - Puppet freshness on mw113 is OK: puppet ran at Mon Jul 1 22:54:04 UTC 2013 [22:54:09] RECOVERY - Puppet freshness on cp1001 is OK: puppet ran at Mon Jul 1 22:54:04 UTC 2013 [22:54:10] RECOVERY - Puppet freshness on srv243 is OK: puppet ran at Mon Jul 1 22:54:04 UTC 2013 [22:54:18] RECOVERY - Puppet freshness on mw92 is OK: puppet ran at Mon Jul 1 22:54:09 UTC 2013 [22:54:18] RECOVERY - Puppet freshness on es2 is OK: puppet ran at Mon Jul 1 22:54:09 UTC 2013 [22:54:18] RECOVERY - Puppet freshness on search23 is OK: puppet ran at Mon Jul 1 22:54:09 UTC 2013 [22:54:18] RECOVERY - Puppet freshness on mw1120 is OK: puppet ran at Mon Jul 1 22:54:14 UTC 2013 [22:54:18] RECOVERY - Puppet freshness on mw1009 is OK: puppet ran at Mon Jul 1 22:54:14 UTC 2013 [22:54:18] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [22:54:25] greg-g, we're running late on the gender deploy. [22:54:28] RECOVERY - Puppet freshness on mw7 is OK: puppet ran at Mon Jul 1 22:54:19 UTC 2013 [22:54:28] RECOVERY - Puppet freshness on mc1002 is OK: puppet ran at Mon Jul 1 22:54:19 UTC 2013 [22:54:28] RECOVERY - Puppet freshness on mw1199 is OK: puppet ran at Mon Jul 1 22:54:19 UTC 2013 [22:54:28] RECOVERY - Puppet freshness on srv293 is OK: puppet ran at Mon Jul 1 22:54:19 UTC 2013 [22:54:28] RECOVERY - Puppet freshness on virt1005 is OK: puppet ran at Mon Jul 1 22:54:19 UTC 2013 [22:54:28] RECOVERY - Puppet freshness on ms-be1010 is OK: puppet ran at Mon Jul 1 22:54:19 UTC 2013 [22:54:29] RECOVERY - Puppet freshness on mw1148 is OK: puppet ran at Mon Jul 1 22:54:19 UTC 2013 [22:54:29] RECOVERY - Puppet freshness on mw1060 is OK: puppet ran at Mon Jul 1 22:54:19 UTC 2013 [22:54:30] RECOVERY - Puppet freshness on mw1008 is OK: puppet ran at Mon Jul 1 22:54:24 UTC 2013 [22:54:30] RECOVERY - Puppet freshness on mw1108 is OK: puppet ran at Mon Jul 1 22:54:24 UTC 2013 [22:54:38] RECOVERY - Puppet freshness on sq66 is OK: puppet ran at Mon Jul 1 22:54:29 UTC 2013 [22:54:38] RECOVERY - Puppet freshness on sq75 is OK: puppet ran at Mon Jul 1 22:54:29 UTC 2013 [22:54:38] RECOVERY - Puppet freshness on srv251 is OK: puppet ran at Mon Jul 1 22:54:29 UTC 2013 [22:54:38] RECOVERY - Puppet freshness on mw3 is OK: puppet ran at Mon Jul 1 22:54:34 UTC 2013 [22:54:38] RECOVERY - Puppet freshness on db1046 is OK: puppet ran at Mon Jul 1 22:54:35 UTC 2013 [22:54:41] greg-g the main commit is merged, but we have a trivial commit to review, then the actual scap. [22:54:45] Is it okay if we run over? [22:54:48] RECOVERY - Puppet freshness on analytics1021 is OK: puppet ran at Mon Jul 1 22:54:40 UTC 2013 [22:54:48] RECOVERY - Puppet freshness on nitrogen is OK: puppet ran at Mon Jul 1 22:54:40 UTC 2013 [22:54:48] RECOVERY - Puppet freshness on cp1009 is OK: puppet ran at Mon Jul 1 22:54:40 UTC 2013 [22:54:48] RECOVERY - Puppet freshness on sq36 is OK: puppet ran at Mon Jul 1 22:54:40 UTC 2013 [22:54:48] RECOVERY - Puppet freshness on mw25 is OK: puppet ran at Mon Jul 1 22:54:40 UTC 2013 [22:54:48] RECOVERY - Puppet freshness on cp1008 is OK: puppet ran at Mon Jul 1 22:54:40 UTC 2013 [22:54:49] RECOVERY - Puppet freshness on hydrogen is OK: puppet ran at Mon Jul 1 22:54:45 UTC 2013 [22:54:49] RECOVERY - Puppet freshness on db1018 is OK: puppet ran at Mon Jul 1 22:54:45 UTC 2013 [22:54:50] RECOVERY - Puppet freshness on db1015 is OK: puppet ran at Mon Jul 1 22:54:45 UTC 2013 [22:54:50] RECOVERY - Puppet freshness on ms-fe1004 is OK: puppet ran at Mon Jul 1 22:54:45 UTC 2013 [22:54:58] RECOVERY - Puppet freshness on pdf2 is OK: puppet ran at Mon Jul 1 22:54:50 UTC 2013 [22:54:58] RECOVERY - Puppet freshness on db64 is OK: puppet ran at Mon Jul 1 22:54:50 UTC 2013 [22:54:58] RECOVERY - Puppet freshness on snapshot3 is OK: puppet ran at Mon Jul 1 22:54:55 UTC 2013 [22:54:58] RECOVERY - Puppet freshness on db1049 is OK: puppet ran at Mon Jul 1 22:54:55 UTC 2013 [22:54:58] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 1 22:54:55 UTC 2013 [22:55:08] RECOVERY - Puppet freshness on cp1064 is OK: puppet ran at Mon Jul 1 22:55:00 UTC 2013 [22:55:08] RECOVERY - Puppet freshness on holmium is OK: puppet ran at Mon Jul 1 22:55:01 UTC 2013 [22:55:08] RECOVERY - Puppet freshness on lvs3 is OK: puppet ran at Mon Jul 1 22:55:01 UTC 2013 [22:55:08] RECOVERY - Puppet freshness on srv235 is OK: puppet ran at Mon Jul 1 22:55:01 UTC 2013 [22:55:13] icinga, shush, I'm trying to talk to superm401 [22:55:20] New patchset: Aude; "set propagateChangesToRepo Wikibase setting to false for test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71549 [22:55:32] superm401: yeah, you can go over, no one else is scheduled [22:55:38] RECOVERY - Puppet freshness on srv257 is OK: puppet ran at Mon Jul 1 22:55:31 UTC 2013 [22:55:38] RECOVERY - Puppet freshness on srv274 is OK: puppet ran at Mon Jul 1 22:55:36 UTC 2013 [22:55:38] RECOVERY - Puppet freshness on mw1170 is OK: puppet ran at Mon Jul 1 22:55:36 UTC 2013 [22:55:38] RECOVERY - Puppet freshness on mw12 is OK: puppet ran at Mon Jul 1 22:55:36 UTC 2013 [22:55:48] RECOVERY - Puppet freshness on mw48 is OK: puppet ran at Mon Jul 1 22:55:46 UTC 2013 [22:55:48] RECOVERY - Puppet freshness on mw1085 is OK: puppet ran at Mon Jul 1 22:55:46 UTC 2013 [22:55:58] RECOVERY - Puppet freshness on mw1061 is OK: puppet ran at Mon Jul 1 22:55:51 UTC 2013 [22:55:58] RECOVERY - Puppet freshness on mw1070 is OK: puppet ran at Mon Jul 1 22:55:51 UTC 2013 [22:55:58] RECOVERY - Puppet freshness on sq56 is OK: puppet ran at Mon Jul 1 22:55:56 UTC 2013 [22:55:59] RECOVERY - Puppet freshness on db1028 is OK: puppet ran at Mon Jul 1 22:55:56 UTC 2013 [22:55:59] RECOVERY - Puppet freshness on ersch is OK: puppet ran at Mon Jul 1 22:55:56 UTC 2013 [22:55:59] RECOVERY - Puppet freshness on amssq60 is OK: puppet ran at Mon Jul 1 22:55:56 UTC 2013 [22:55:59] RECOVERY - Puppet freshness on mw1102 is OK: puppet ran at Mon Jul 1 22:55:56 UTC 2013 [22:55:59] RECOVERY - Puppet freshness on mw1058 is OK: puppet ran at Mon Jul 1 22:55:56 UTC 2013 [22:56:08] RECOVERY - Puppet freshness on db68 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:08] RECOVERY - Puppet freshness on amssq41 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:08] RECOVERY - Puppet freshness on db38 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:08] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:08] RECOVERY - Puppet freshness on mw1019 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:08] RECOVERY - Puppet freshness on amssq36 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:09] RECOVERY - Puppet freshness on cp1057 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:09] RECOVERY - Puppet freshness on ms-be7 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:10] RECOVERY - Puppet freshness on mw1157 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:10] RECOVERY - Puppet freshness on mw115 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:11] RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:11] RECOVERY - Puppet freshness on mw1119 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:12] RECOVERY - Puppet freshness on mw56 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:12] RECOVERY - Puppet freshness on mc5 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:13] RECOVERY - Puppet freshness on spence is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:13] RECOVERY - Puppet freshness on wtp1014 is OK: puppet ran at Mon Jul 1 22:56:01 UTC 2013 [22:56:13] RECOVERY - Puppet freshness on mw114 is OK: puppet ran at Mon Jul 1 22:56:06 UTC 2013 [22:56:14] RECOVERY - Puppet freshness on ssl1004 is OK: puppet ran at Mon Jul 1 22:56:06 UTC 2013 [22:56:15] RECOVERY - Puppet freshness on analytics1006 is OK: puppet ran at Mon Jul 1 22:56:06 UTC 2013 [22:56:15] RECOVERY - Puppet freshness on mw120 is OK: puppet ran at Mon Jul 1 22:56:06 UTC 2013 [22:56:18] RECOVERY - Puppet freshness on search20 is OK: puppet ran at Mon Jul 1 22:56:12 UTC 2013 [22:56:18] RECOVERY - Puppet freshness on srv239 is OK: puppet ran at Mon Jul 1 22:56:12 UTC 2013 [22:56:18] RECOVERY - Puppet freshness on srv252 is OK: puppet ran at Mon Jul 1 22:56:12 UTC 2013 [22:56:18] RECOVERY - Puppet freshness on mw67 is OK: puppet ran at Mon Jul 1 22:56:12 UTC 2013 [22:56:18] RECOVERY - Puppet freshness on search1009 is OK: puppet ran at Mon Jul 1 22:56:12 UTC 2013 [22:56:18] RECOVERY - Puppet freshness on mw96 is OK: puppet ran at Mon Jul 1 22:56:17 UTC 2013 [22:56:19] RECOVERY - Puppet freshness on srv247 is OK: puppet ran at Mon Jul 1 22:56:17 UTC 2013 [22:56:19] RECOVERY - Puppet freshness on mw78 is OK: puppet ran at Mon Jul 1 22:56:17 UTC 2013 [22:56:20] RECOVERY - Puppet freshness on search1013 is OK: puppet ran at Mon Jul 1 22:56:17 UTC 2013 [22:56:28] RECOVERY - Puppet freshness on mw1184 is OK: puppet ran at Mon Jul 1 22:56:22 UTC 2013 [22:56:28] RECOVERY - Puppet freshness on mw1177 is OK: puppet ran at Mon Jul 1 22:56:22 UTC 2013 [22:56:28] RECOVERY - Puppet freshness on mw1172 is OK: puppet ran at Mon Jul 1 22:56:22 UTC 2013 [22:56:28] RECOVERY - Puppet freshness on mw1191 is OK: puppet ran at Mon Jul 1 22:56:22 UTC 2013 [22:56:28] RECOVERY - Puppet freshness on amssq50 is OK: puppet ran at Mon Jul 1 22:56:22 UTC 2013 [22:56:28] RECOVERY - Puppet freshness on sq61 is OK: puppet ran at Mon Jul 1 22:56:27 UTC 2013 [22:56:29] RECOVERY - Puppet freshness on amssq58 is OK: puppet ran at Mon Jul 1 22:56:27 UTC 2013 [22:56:29] RECOVERY - Puppet freshness on tmh1002 is OK: puppet ran at Mon Jul 1 22:56:27 UTC 2013 [22:56:30] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Mon Jul 1 22:56:27 UTC 2013 [22:56:38] RECOVERY - Puppet freshness on mw1039 is OK: puppet ran at Mon Jul 1 22:56:37 UTC 2013 [22:56:38] RECOVERY - Puppet freshness on nfs2 is OK: puppet ran at Mon Jul 1 22:56:37 UTC 2013 [22:56:38] RECOVERY - Puppet freshness on wtp1008 is OK: puppet ran at Mon Jul 1 22:56:37 UTC 2013 [22:56:38] RECOVERY - Puppet freshness on hooper is OK: puppet ran at Mon Jul 1 22:56:37 UTC 2013 [22:56:48] RECOVERY - Puppet freshness on cp1019 is OK: puppet ran at Mon Jul 1 22:56:37 UTC 2013 [22:56:48] RECOVERY - Puppet freshness on db1003 is OK: puppet ran at Mon Jul 1 22:56:42 UTC 2013 [22:56:58] RECOVERY - Puppet freshness on srv298 is OK: puppet ran at Mon Jul 1 22:56:47 UTC 2013 [22:56:58] RECOVERY - Puppet freshness on db37 is OK: puppet ran at Mon Jul 1 22:56:48 UTC 2013 [22:56:58] RECOVERY - Puppet freshness on ms-fe4 is OK: puppet ran at Mon Jul 1 22:56:48 UTC 2013 [22:56:58] RECOVERY - Puppet freshness on es7 is OK: puppet ran at Mon Jul 1 22:56:53 UTC 2013 [22:56:58] RECOVERY - Puppet freshness on db33 is OK: puppet ran at Mon Jul 1 22:56:53 UTC 2013 [22:56:58] RECOVERY - Puppet freshness on virt10 is OK: puppet ran at Mon Jul 1 22:56:53 UTC 2013 [22:56:59] RECOVERY - Puppet freshness on es1007 is OK: puppet ran at Mon Jul 1 22:56:53 UTC 2013 [22:56:59] RECOVERY - Puppet freshness on db1043 is OK: puppet ran at Mon Jul 1 22:56:53 UTC 2013 [22:57:00] RECOVERY - Puppet freshness on silver is OK: puppet ran at Mon Jul 1 22:56:53 UTC 2013 [22:57:08] RECOVERY - Puppet freshness on wtp1005 is OK: puppet ran at Mon Jul 1 22:56:58 UTC 2013 [22:57:08] RECOVERY - Puppet freshness on sq57 is OK: puppet ran at Mon Jul 1 22:56:58 UTC 2013 [22:57:08] RECOVERY - Puppet freshness on mw1073 is OK: puppet ran at Mon Jul 1 22:57:03 UTC 2013 [22:57:08] RECOVERY - Puppet freshness on db1041 is OK: puppet ran at Mon Jul 1 22:57:03 UTC 2013 [22:57:08] RECOVERY - Puppet freshness on mw63 is OK: puppet ran at Mon Jul 1 22:57:03 UTC 2013 [22:57:08] RECOVERY - Puppet freshness on mw84 is OK: puppet ran at Mon Jul 1 22:57:03 UTC 2013 [22:57:18] RECOVERY - Puppet freshness on search26 is OK: puppet ran at Mon Jul 1 22:57:08 UTC 2013 [22:57:18] RECOVERY - Puppet freshness on mw17 is OK: puppet ran at Mon Jul 1 22:57:13 UTC 2013 [22:57:38] RECOVERY - Puppet freshness on harmon is OK: puppet ran at Mon Jul 1 22:57:33 UTC 2013 [22:57:48] RECOVERY - Puppet freshness on searchidx2 is OK: puppet ran at Mon Jul 1 22:57:38 UTC 2013 [22:57:48] RECOVERY - Puppet freshness on snapshot1001 is OK: puppet ran at Mon Jul 1 22:57:43 UTC 2013 [22:57:48] RECOVERY - Puppet freshness on wtp1012 is OK: puppet ran at Mon Jul 1 22:57:43 UTC 2013 [22:57:58] RECOVERY - Puppet freshness on mc1010 is OK: puppet ran at Mon Jul 1 22:57:48 UTC 2013 [22:57:58] RECOVERY - Puppet freshness on lvs4 is OK: puppet ran at Mon Jul 1 22:57:48 UTC 2013 [22:57:58] RECOVERY - Puppet freshness on mc1014 is OK: puppet ran at Mon Jul 1 22:57:48 UTC 2013 [22:57:58] RECOVERY - Puppet freshness on sq44 is OK: puppet ran at Mon Jul 1 22:57:48 UTC 2013 [22:57:58] RECOVERY - Puppet freshness on db1039 is OK: puppet ran at Mon Jul 1 22:57:48 UTC 2013 [22:58:31] greg-g: there you go, it flooded out [22:58:41] aaaaaaaahhhhhhhh, silence, it is golden [22:59:08] RECOVERY - Puppet freshness on rdb1001 is OK: puppet ran at Mon Jul 1 22:59:00 UTC 2013 [22:59:08] RECOVERY - Puppet freshness on mw4 is OK: puppet ran at Mon Jul 1 22:59:00 UTC 2013 [22:59:08] RECOVERY - Puppet freshness on search32 is OK: puppet ran at Mon Jul 1 22:59:00 UTC 2013 [22:59:08] RECOVERY - Puppet freshness on hooft is OK: puppet ran at Mon Jul 1 22:59:00 UTC 2013 [22:59:08] RECOVERY - Puppet freshness on virt1007 is OK: puppet ran at Mon Jul 1 22:59:01 UTC 2013 [22:59:08] RECOVERY - Puppet freshness on formey is OK: puppet ran at Mon Jul 1 22:59:01 UTC 2013 [22:59:08] RECOVERY - Puppet freshness on oxygen is OK: puppet ran at Mon Jul 1 22:59:06 UTC 2013 [22:59:09] RECOVERY - Puppet freshness on mw1034 is OK: puppet ran at Mon Jul 1 22:59:06 UTC 2013 [22:59:10] RECOVERY - Puppet freshness on db1030 is OK: puppet ran at Mon Jul 1 22:59:06 UTC 2013 [22:59:10] RECOVERY - Puppet freshness on mw1156 is OK: puppet ran at Mon Jul 1 22:59:06 UTC 2013 [22:59:48] RECOVERY - Puppet freshness on mw123 is OK: puppet ran at Mon Jul 1 22:59:41 UTC 2013 [22:59:48] RECOVERY - Puppet freshness on srv263 is OK: puppet ran at Mon Jul 1 22:59:41 UTC 2013 [22:59:48] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 1 22:59:41 UTC 2013 [22:59:48] RECOVERY - Puppet freshness on db1053 is OK: puppet ran at Mon Jul 1 22:59:41 UTC 2013 [22:59:48] RECOVERY - Puppet freshness on cp1055 is OK: puppet ran at Mon Jul 1 22:59:41 UTC 2013 [22:59:48] RECOVERY - Puppet freshness on mw1183 is OK: puppet ran at Mon Jul 1 22:59:41 UTC 2013 [22:59:49] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Mon Jul 1 22:59:46 UTC 2013 [22:59:49] RECOVERY - Puppet freshness on db78 is OK: puppet ran at Mon Jul 1 22:59:46 UTC 2013 [22:59:49] RECOVERY - Puppet freshness on amssq62 is OK: puppet ran at Mon Jul 1 22:59:46 UTC 2013 [22:59:50] RECOVERY - Puppet freshness on amssq46 is OK: puppet ran at Mon Jul 1 22:59:46 UTC 2013 [22:59:51] RECOVERY - Puppet freshness on db1059 is OK: puppet ran at Mon Jul 1 22:59:46 UTC 2013 [22:59:51] RECOVERY - Puppet freshness on wtp1022 is OK: puppet ran at Mon Jul 1 22:59:46 UTC 2013 [22:59:52] RECOVERY - Puppet freshness on mw1050 is OK: puppet ran at Mon Jul 1 22:59:46 UTC 2013 [22:59:52] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Jul 1 22:59:46 UTC 2013 [22:59:53] RECOVERY - Puppet freshness on wtp1023 is OK: puppet ran at Mon Jul 1 22:59:46 UTC 2013 [22:59:58] RECOVERY - Puppet freshness on cp1033 is OK: puppet ran at Mon Jul 1 22:59:51 UTC 2013 [22:59:58] RECOVERY - Puppet freshness on mw102 is OK: puppet ran at Mon Jul 1 22:59:56 UTC 2013 [22:59:58] RECOVERY - Puppet freshness on srv279 is OK: puppet ran at Mon Jul 1 22:59:56 UTC 2013 [22:59:59] RECOVERY - Puppet freshness on mw1212 is OK: puppet ran at Mon Jul 1 22:59:56 UTC 2013 [22:59:59] RECOVERY - Puppet freshness on srv241 is OK: puppet ran at Mon Jul 1 22:59:56 UTC 2013 [23:00:08] RECOVERY - Puppet freshness on srv288 is OK: puppet ran at Mon Jul 1 23:00:01 UTC 2013 [23:00:08] RECOVERY - Puppet freshness on mw44 is OK: puppet ran at Mon Jul 1 23:00:01 UTC 2013 [23:00:08] RECOVERY - Puppet freshness on snapshot1004 is OK: puppet ran at Mon Jul 1 23:00:02 UTC 2013 [23:00:08] RECOVERY - Puppet freshness on search1017 is OK: puppet ran at Mon Jul 1 23:00:02 UTC 2013 [23:00:08] RECOVERY - Puppet freshness on search1016 is OK: puppet ran at Mon Jul 1 23:00:02 UTC 2013 [23:00:08] RECOVERY - Puppet freshness on mw27 is OK: puppet ran at Mon Jul 1 23:00:02 UTC 2013 [23:00:09] RECOVERY - Puppet freshness on mw1077 is OK: puppet ran at Mon Jul 1 23:00:02 UTC 2013 [23:00:09] RECOVERY - Puppet freshness on mw1163 is OK: puppet ran at Mon Jul 1 23:00:02 UTC 2013 [23:00:10] RECOVERY - Puppet freshness on mw1029 is OK: puppet ran at Mon Jul 1 23:00:02 UTC 2013 [23:00:10] RECOVERY - Puppet freshness on mw1097 is OK: puppet ran at Mon Jul 1 23:00:07 UTC 2013 [23:00:10] RECOVERY - Puppet freshness on mw1002 is OK: puppet ran at Mon Jul 1 23:00:07 UTC 2013 [23:00:38] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:38] RECOVERY - Puppet freshness on mw1023 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:38] RECOVERY - Puppet freshness on mw1096 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:38] RECOVERY - Puppet freshness on cp1052 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:38] RECOVERY - Puppet freshness on es8 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:39] RECOVERY - Puppet freshness on sq80 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:39] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:39] RECOVERY - Puppet freshness on db1008 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:40] RECOVERY - Puppet freshness on ms1001 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:40] RECOVERY - Puppet freshness on db47 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:40] RECOVERY - Puppet freshness on virt5 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:41] RECOVERY - Puppet freshness on rdb1004 is OK: puppet ran at Mon Jul 1 23:00:37 UTC 2013 [23:00:48] RECOVERY - Puppet freshness on mw6 is OK: puppet ran at Mon Jul 1 23:00:42 UTC 2013 [23:00:48] RECOVERY - Puppet freshness on db62 is OK: puppet ran at Mon Jul 1 23:00:42 UTC 2013 [23:00:58] RECOVERY - Puppet freshness on db1047 is OK: puppet ran at Mon Jul 1 23:00:47 UTC 2013 [23:00:58] RECOVERY - Puppet freshness on ms-be1011 is OK: puppet ran at Mon Jul 1 23:00:47 UTC 2013 [23:00:58] RECOVERY - Puppet freshness on locke is OK: puppet ran at Mon Jul 1 23:00:47 UTC 2013 [23:00:58] RECOVERY - Puppet freshness on pc1001 is OK: puppet ran at Mon Jul 1 23:00:47 UTC 2013 [23:00:58] RECOVERY - Puppet freshness on cp1059 is OK: puppet ran at Mon Jul 1 23:00:47 UTC 2013 [23:00:58] RECOVERY - Puppet freshness on es1010 is OK: puppet ran at Mon Jul 1 23:00:47 UTC 2013 [23:00:59] RECOVERY - Puppet freshness on cp1029 is OK: puppet ran at Mon Jul 1 23:00:53 UTC 2013 [23:00:59] RECOVERY - Puppet freshness on cp1031 is OK: puppet ran at Mon Jul 1 23:00:53 UTC 2013 [23:01:00] RECOVERY - Puppet freshness on amssq38 is OK: puppet ran at Mon Jul 1 23:00:53 UTC 2013 [23:01:00] RECOVERY - Puppet freshness on srv281 is OK: puppet ran at Mon Jul 1 23:00:53 UTC 2013 [23:01:08] RECOVERY - Puppet freshness on db1038 is OK: puppet ran at Mon Jul 1 23:00:58 UTC 2013 [23:01:08] RECOVERY - Puppet freshness on mw1067 is OK: puppet ran at Mon Jul 1 23:01:03 UTC 2013 [23:01:08] RECOVERY - Puppet freshness on mw1152 is OK: puppet ran at Mon Jul 1 23:01:03 UTC 2013 [23:01:08] RECOVERY - Puppet freshness on wtp1017 is OK: puppet ran at Mon Jul 1 23:01:03 UTC 2013 [23:01:08] RECOVERY - Puppet freshness on nescio is OK: puppet ran at Mon Jul 1 23:01:03 UTC 2013 [23:01:08] RECOVERY - Puppet freshness on pc3 is OK: puppet ran at Mon Jul 1 23:01:03 UTC 2013 [23:01:18] RECOVERY - Puppet freshness on db35 is OK: puppet ran at Mon Jul 1 23:01:08 UTC 2013 [23:01:18] RECOVERY - Puppet freshness on search1020 is OK: puppet ran at Mon Jul 1 23:01:08 UTC 2013 [23:01:18] RECOVERY - Puppet freshness on mw1005 is OK: puppet ran at Mon Jul 1 23:01:08 UTC 2013 [23:01:28] RECOVERY - Puppet freshness on mw1028 is OK: puppet ran at Mon Jul 1 23:01:18 UTC 2013 [23:01:58] RECOVERY - Puppet freshness on wtp1019 is OK: puppet ran at Mon Jul 1 23:01:48 UTC 2013 [23:01:58] RECOVERY - Puppet freshness on cp1022 is OK: puppet ran at Mon Jul 1 23:01:53 UTC 2013 [23:02:08] RECOVERY - Puppet freshness on cp1014 is OK: puppet ran at Mon Jul 1 23:01:58 UTC 2013 [23:02:08] RECOVERY - Puppet freshness on db71 is OK: puppet ran at Mon Jul 1 23:01:58 UTC 2013 [23:02:08] RECOVERY - Puppet freshness on ms-fe1 is OK: puppet ran at Mon Jul 1 23:01:58 UTC 2013 [23:02:08] RECOVERY - Puppet freshness on mc9 is OK: puppet ran at Mon Jul 1 23:01:58 UTC 2013 [23:02:08] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Mon Jul 1 23:01:58 UTC 2013 [23:03:18] RECOVERY - Puppet freshness on srv295 is OK: puppet ran at Mon Jul 1 23:03:10 UTC 2013 [23:03:18] RECOVERY - Puppet freshness on cp3004 is OK: puppet ran at Mon Jul 1 23:03:10 UTC 2013 [23:03:18] RECOVERY - Puppet freshness on mw1113 is OK: puppet ran at Mon Jul 1 23:03:15 UTC 2013 [23:03:18] RECOVERY - Puppet freshness on cp1045 is OK: puppet ran at Mon Jul 1 23:03:15 UTC 2013 [23:03:28] RECOVERY - Puppet freshness on mw1160 is OK: puppet ran at Mon Jul 1 23:03:20 UTC 2013 [23:03:28] RECOVERY - Puppet freshness on cp1047 is OK: puppet ran at Mon Jul 1 23:03:20 UTC 2013 [23:03:28] RECOVERY - Puppet freshness on maerlant is OK: puppet ran at Mon Jul 1 23:03:20 UTC 2013 [23:03:28] RECOVERY - Puppet freshness on search28 is OK: puppet ran at Mon Jul 1 23:03:25 UTC 2013 [23:03:28] RECOVERY - Puppet freshness on mw94 is OK: puppet ran at Mon Jul 1 23:03:25 UTC 2013 [23:03:29] RECOVERY - Puppet freshness on mw105 is OK: puppet ran at Mon Jul 1 23:03:25 UTC 2013 [23:03:38] RECOVERY - Puppet freshness on mw101 is OK: puppet ran at Mon Jul 1 23:03:30 UTC 2013 [23:03:38] RECOVERY - Puppet freshness on mw108 is OK: puppet ran at Mon Jul 1 23:03:30 UTC 2013 [23:03:38] RECOVERY - Puppet freshness on virt6 is OK: puppet ran at Mon Jul 1 23:03:30 UTC 2013 [23:03:38] RECOVERY - Puppet freshness on mw1187 is OK: puppet ran at Mon Jul 1 23:03:30 UTC 2013 [23:03:38] RECOVERY - Puppet freshness on mw1207 is OK: puppet ran at Mon Jul 1 23:03:30 UTC 2013 [23:03:39] RECOVERY - Puppet freshness on mw1037 is OK: puppet ran at Mon Jul 1 23:03:35 UTC 2013 [23:03:48] RECOVERY - Puppet freshness on mc1015 is OK: puppet ran at Mon Jul 1 23:03:40 UTC 2013 [23:03:48] RECOVERY - Puppet freshness on pc2 is OK: puppet ran at Mon Jul 1 23:03:40 UTC 2013 [23:03:48] RECOVERY - Puppet freshness on sq67 is OK: puppet ran at Mon Jul 1 23:03:45 UTC 2013 [23:03:48] RECOVERY - Puppet freshness on cp3003 is OK: puppet ran at Mon Jul 1 23:03:45 UTC 2013 [23:03:48] RECOVERY - Puppet freshness on db52 is OK: puppet ran at Mon Jul 1 23:03:45 UTC 2013 [23:03:49] RECOVERY - Puppet freshness on mc1003 is OK: puppet ran at Mon Jul 1 23:03:45 UTC 2013 [23:03:58] RECOVERY - Puppet freshness on stafford is OK: puppet ran at Mon Jul 1 23:03:55 UTC 2013 [23:04:08] RECOVERY - Puppet freshness on srv287 is OK: puppet ran at Mon Jul 1 23:04:00 UTC 2013 [23:04:08] RECOVERY - Puppet freshness on mw16 is OK: puppet ran at Mon Jul 1 23:04:00 UTC 2013 [23:04:08] RECOVERY - Puppet freshness on srv300 is OK: puppet ran at Mon Jul 1 23:04:05 UTC 2013 [23:04:08] RECOVERY - Puppet freshness on mw51 is OK: puppet ran at Mon Jul 1 23:04:05 UTC 2013 [23:04:08] RECOVERY - Puppet freshness on mw1217 is OK: puppet ran at Mon Jul 1 23:04:05 UTC 2013 [23:04:09] RECOVERY - Puppet freshness on search13 is OK: puppet ran at Mon Jul 1 23:04:05 UTC 2013 [23:04:09] RECOVERY - Puppet freshness on analytics1012 is OK: puppet ran at Mon Jul 1 23:04:05 UTC 2013 [23:04:10] RECOVERY - Puppet freshness on tmh2 is OK: puppet ran at Mon Jul 1 23:04:05 UTC 2013 [23:04:10] RECOVERY - Puppet freshness on mw1164 is OK: puppet ran at Mon Jul 1 23:04:05 UTC 2013 [23:04:58] RECOVERY - Puppet freshness on mw1176 is OK: puppet ran at Mon Jul 1 23:04:50 UTC 2013 [23:04:58] RECOVERY - Puppet freshness on labsdb1002 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:04:58] RECOVERY - Puppet freshness on mw1099 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:04:58] RECOVERY - Puppet freshness on mw1088 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:04:58] RECOVERY - Puppet freshness on sq86 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:04:59] RECOVERY - Puppet freshness on sq83 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:04:59] RECOVERY - Puppet freshness on db1054 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:05:00] RECOVERY - Puppet freshness on yvon is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:05:00] RECOVERY - Puppet freshness on cerium is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:05:01] RECOVERY - Puppet freshness on es1 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:05:01] RECOVERY - Puppet freshness on labsdb1003 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:05:02] RECOVERY - Puppet freshness on amssq52 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:05:02] RECOVERY - Puppet freshness on mw1100 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:05:03] RECOVERY - Puppet freshness on mw1117 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:05:03] RECOVERY - Puppet freshness on ssl2 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:05:04] RECOVERY - Puppet freshness on search15 is OK: puppet ran at Mon Jul 1 23:04:56 UTC 2013 [23:05:08] RECOVERY - Puppet freshness on mw34 is OK: puppet ran at Mon Jul 1 23:05:01 UTC 2013 [23:05:08] RECOVERY - Puppet freshness on srv238 is OK: puppet ran at Mon Jul 1 23:05:01 UTC 2013 [23:05:08] RECOVERY - Puppet freshness on mw122 is OK: puppet ran at Mon Jul 1 23:05:01 UTC 2013 [23:05:08] RECOVERY - Puppet freshness on mw66 is OK: puppet ran at Mon Jul 1 23:05:01 UTC 2013 [23:05:08] RECOVERY - Puppet freshness on srv297 is OK: puppet ran at Mon Jul 1 23:05:01 UTC 2013 [23:05:09] RECOVERY - Puppet freshness on mw82 is OK: puppet ran at Mon Jul 1 23:05:01 UTC 2013 [23:05:09] RECOVERY - Puppet freshness on mw46 is OK: puppet ran at Mon Jul 1 23:05:01 UTC 2013 [23:05:17] RECOVERY - Puppet freshness on search1008 is OK: puppet ran at Mon Jul 1 23:05:01 UTC 2013 [23:05:17] RECOVERY - Puppet freshness on es1001 is OK: puppet ran at Mon Jul 1 23:05:06 UTC 2013 [23:05:17] RECOVERY - Puppet freshness on search1019 is OK: puppet ran at Mon Jul 1 23:05:06 UTC 2013 [23:05:17] RECOVERY - Puppet freshness on mw1015 is OK: puppet ran at Mon Jul 1 23:05:06 UTC 2013 [23:05:18] RECOVERY - Puppet freshness on cp1061 is OK: puppet ran at Mon Jul 1 23:05:11 UTC 2013 [23:05:18] RECOVERY - Puppet freshness on cp1056 is OK: puppet ran at Mon Jul 1 23:05:16 UTC 2013 [23:05:18] RECOVERY - Puppet freshness on db40 is OK: puppet ran at Mon Jul 1 23:05:16 UTC 2013 [23:05:28] RECOVERY - Puppet freshness on search1018 is OK: puppet ran at Mon Jul 1 23:05:21 UTC 2013 [23:05:28] RECOVERY - Puppet freshness on williams is OK: puppet ran at Mon Jul 1 23:05:26 UTC 2013 [23:05:38] RECOVERY - Puppet freshness on mw1052 is OK: puppet ran at Mon Jul 1 23:05:32 UTC 2013 [23:05:38] RECOVERY - Puppet freshness on sq65 is OK: puppet ran at Mon Jul 1 23:05:32 UTC 2013 [23:05:48] RECOVERY - Puppet freshness on cp1020 is OK: puppet ran at Mon Jul 1 23:05:42 UTC 2013 [23:05:48] RECOVERY - Puppet freshness on neon is OK: puppet ran at Mon Jul 1 23:05:47 UTC 2013 [23:05:48] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Mon Jul 1 23:05:47 UTC 2013 [23:05:58] RECOVERY - Puppet freshness on es1005 is OK: puppet ran at Mon Jul 1 23:05:52 UTC 2013 [23:06:08] RECOVERY - Puppet freshness on mw1101 is OK: puppet ran at Mon Jul 1 23:06:02 UTC 2013 [23:06:08] RECOVERY - Puppet freshness on mw1123 is OK: puppet ran at Mon Jul 1 23:06:02 UTC 2013 [23:06:08] RECOVERY - Puppet freshness on cp1028 is OK: puppet ran at Mon Jul 1 23:06:02 UTC 2013 [23:06:08] RECOVERY - Puppet freshness on ms-be1 is OK: puppet ran at Mon Jul 1 23:06:02 UTC 2013 [23:06:08] RECOVERY - Puppet freshness on cp3007 is OK: puppet ran at Mon Jul 1 23:06:02 UTC 2013 [23:06:09] RECOVERY - Puppet freshness on mw1042 is OK: puppet ran at Mon Jul 1 23:06:07 UTC 2013 [23:06:28] RECOVERY - Puppet freshness on pdf3 is OK: puppet ran at Mon Jul 1 23:06:22 UTC 2013 [23:06:38] RECOVERY - Puppet freshness on dobson is OK: puppet ran at Mon Jul 1 23:06:32 UTC 2013 [23:06:48] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [23:06:58] RECOVERY - Puppet freshness on amssq57 is OK: puppet ran at Mon Jul 1 23:06:52 UTC 2013 [23:06:58] RECOVERY - Puppet freshness on db1052 is OK: puppet ran at Mon Jul 1 23:06:52 UTC 2013 [23:06:58] RECOVERY - Puppet freshness on gurvin is OK: puppet ran at Mon Jul 1 23:06:52 UTC 2013 [23:06:58] RECOVERY - Puppet freshness on sq72 is OK: puppet ran at Mon Jul 1 23:06:52 UTC 2013 [23:06:59] RECOVERY - Puppet freshness on search22 is OK: puppet ran at Mon Jul 1 23:06:57 UTC 2013 [23:06:59] RECOVERY - Puppet freshness on search17 is OK: puppet ran at Mon Jul 1 23:06:57 UTC 2013 [23:06:59] RECOVERY - Puppet freshness on db1019 is OK: puppet ran at Mon Jul 1 23:06:57 UTC 2013 [23:07:08] RECOVERY - Puppet freshness on es4 is OK: puppet ran at Mon Jul 1 23:06:57 UTC 2013 [23:07:08] RECOVERY - Puppet freshness on cp1012 is OK: puppet ran at Mon Jul 1 23:06:57 UTC 2013 [23:07:08] RECOVERY - Puppet freshness on tin is OK: puppet ran at Mon Jul 1 23:06:57 UTC 2013 [23:07:08] RECOVERY - Puppet freshness on cp1003 is OK: puppet ran at Mon Jul 1 23:06:57 UTC 2013 [23:07:08] RECOVERY - Puppet freshness on es1003 is OK: puppet ran at Mon Jul 1 23:06:57 UTC 2013 [23:08:18] RECOVERY - Puppet freshness on cp1046 is OK: puppet ran at Mon Jul 1 23:08:10 UTC 2013 [23:08:18] RECOVERY - Puppet freshness on cp1050 is OK: puppet ran at Mon Jul 1 23:08:10 UTC 2013 [23:08:18] RECOVERY - Puppet freshness on mw1018 is OK: puppet ran at Mon Jul 1 23:08:10 UTC 2013 [23:08:18] RECOVERY - Puppet freshness on mw5 is OK: puppet ran at Mon Jul 1 23:08:10 UTC 2013 [23:08:18] RECOVERY - Puppet freshness on mw1111 is OK: puppet ran at Mon Jul 1 23:08:15 UTC 2013 [23:08:18] RECOVERY - Puppet freshness on mw1044 is OK: puppet ran at Mon Jul 1 23:08:15 UTC 2013 [23:08:28] RECOVERY - Puppet freshness on mw118 is OK: puppet ran at Mon Jul 1 23:08:20 UTC 2013 [23:08:28] RECOVERY - Puppet freshness on mw1124 is OK: puppet ran at Mon Jul 1 23:08:20 UTC 2013 [23:08:28] RECOVERY - Puppet freshness on gadolinium is OK: puppet ran at Mon Jul 1 23:08:20 UTC 2013 [23:08:28] RECOVERY - Puppet freshness on analytics1015 is OK: puppet ran at Mon Jul 1 23:08:20 UTC 2013 [23:08:28] RECOVERY - Puppet freshness on mw1147 is OK: puppet ran at Mon Jul 1 23:08:25 UTC 2013 [23:08:28] RECOVERY - Puppet freshness on srv301 is OK: puppet ran at Mon Jul 1 23:08:25 UTC 2013 [23:08:28] RECOVERY - Puppet freshness on mw1190 is OK: puppet ran at Mon Jul 1 23:08:25 UTC 2013 [23:08:29] RECOVERY - Puppet freshness on mw1125 is OK: puppet ran at Mon Jul 1 23:08:25 UTC 2013 [23:08:29] RECOVERY - Puppet freshness on sq51 is OK: puppet ran at Mon Jul 1 23:08:25 UTC 2013 [23:08:30] RECOVERY - Puppet freshness on mw1130 is OK: puppet ran at Mon Jul 1 23:08:25 UTC 2013 [23:08:38] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [23:08:38] RECOVERY - Puppet freshness on cp1007 is OK: puppet ran at Mon Jul 1 23:08:30 UTC 2013 [23:08:38] RECOVERY - Puppet freshness on amssq44 is OK: puppet ran at Mon Jul 1 23:08:35 UTC 2013 [23:08:38] RECOVERY - Puppet freshness on ms-fe3001 is OK: puppet ran at Mon Jul 1 23:08:35 UTC 2013 [23:08:48] RECOVERY - Puppet freshness on db1023 is OK: puppet ran at Mon Jul 1 23:08:45 UTC 2013 [23:08:48] RECOVERY - Puppet freshness on solr2 is OK: puppet ran at Mon Jul 1 23:08:45 UTC 2013 [23:08:48] RECOVERY - Puppet freshness on ms-fe3 is OK: puppet ran at Mon Jul 1 23:08:45 UTC 2013 [23:08:48] RECOVERY - Puppet freshness on db45 is OK: puppet ran at Mon Jul 1 23:08:45 UTC 2013 [23:08:48] RECOVERY - Puppet freshness on ms-be2 is OK: puppet ran at Mon Jul 1 23:08:45 UTC 2013 [23:08:58] RECOVERY - Puppet freshness on db43 is OK: puppet ran at Mon Jul 1 23:08:51 UTC 2013 [23:08:58] RECOVERY - Puppet freshness on ms-be1012 is OK: puppet ran at Mon Jul 1 23:08:51 UTC 2013 [23:08:58] RECOVERY - Puppet freshness on db1020 is OK: puppet ran at Mon Jul 1 23:08:51 UTC 2013 [23:08:58] RECOVERY - Puppet freshness on wtp1003 is OK: puppet ran at Mon Jul 1 23:08:51 UTC 2013 [23:08:58] RECOVERY - Puppet freshness on vanadium is OK: puppet ran at Mon Jul 1 23:08:51 UTC 2013 [23:08:58] RECOVERY - Puppet freshness on wtp1018 is OK: puppet ran at Mon Jul 1 23:08:51 UTC 2013 [23:08:59] RECOVERY - Puppet freshness on cp1063 is OK: puppet ran at Mon Jul 1 23:08:56 UTC 2013 [23:08:59] RECOVERY - Puppet freshness on db1004 is OK: puppet ran at Mon Jul 1 23:08:56 UTC 2013 [23:08:59] RECOVERY - Puppet freshness on srv249 is OK: puppet ran at Mon Jul 1 23:08:56 UTC 2013 [23:09:00] RECOVERY - Puppet freshness on wtp1011 is OK: puppet ran at Mon Jul 1 23:08:56 UTC 2013 [23:09:01] RECOVERY - Puppet freshness on analytics1023 is OK: puppet ran at Mon Jul 1 23:08:56 UTC 2013 [23:09:01] RECOVERY - Puppet freshness on srv244 is OK: puppet ran at Mon Jul 1 23:08:56 UTC 2013 [23:09:08] RECOVERY - Puppet freshness on srv253 is OK: puppet ran at Mon Jul 1 23:09:01 UTC 2013 [23:09:08] RECOVERY - Puppet freshness on search19 is OK: puppet ran at Mon Jul 1 23:09:01 UTC 2013 [23:09:08] RECOVERY - Puppet freshness on srv246 is OK: puppet ran at Mon Jul 1 23:09:01 UTC 2013 [23:09:08] RECOVERY - Puppet freshness on srv289 is OK: puppet ran at Mon Jul 1 23:09:01 UTC 2013 [23:09:08] RECOVERY - Puppet freshness on mw30 is OK: puppet ran at Mon Jul 1 23:09:01 UTC 2013 [23:09:08] RECOVERY - Puppet freshness on mw1146 is OK: puppet ran at Mon Jul 1 23:09:06 UTC 2013 [23:09:18] RECOVERY - Puppet freshness on mw1057 is OK: puppet ran at Mon Jul 1 23:09:16 UTC 2013 [23:09:38] RECOVERY - Puppet freshness on mw1004 is OK: puppet ran at Mon Jul 1 23:09:36 UTC 2013 [23:09:38] RECOVERY - Puppet freshness on mw1038 is OK: puppet ran at Mon Jul 1 23:09:36 UTC 2013 [23:09:48] RECOVERY - Puppet freshness on mw1115 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:48] RECOVERY - Puppet freshness on mw76 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:48] RECOVERY - Puppet freshness on sq74 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:48] RECOVERY - Puppet freshness on es6 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:48] RECOVERY - Puppet freshness on sq48 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:48] RECOVERY - Puppet freshness on sq43 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:48] RECOVERY - Puppet freshness on sq85 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:49] RECOVERY - Puppet freshness on mw1056 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:50] RECOVERY - Puppet freshness on solr1003 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:51] RECOVERY - Puppet freshness on mw1089 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:51] RECOVERY - Puppet freshness on solr3 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:51] RECOVERY - Puppet freshness on db58 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:51] RECOVERY - Puppet freshness on sq42 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:52] RECOVERY - Puppet freshness on mw1081 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:53] RECOVERY - Puppet freshness on mw1159 is OK: puppet ran at Mon Jul 1 23:09:41 UTC 2013 [23:09:53] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Mon Jul 1 23:09:42 UTC 2013 [23:09:54] RECOVERY - Puppet freshness on snapshot1 is OK: puppet ran at Mon Jul 1 23:09:42 UTC 2013 [23:09:54] RECOVERY - Puppet freshness on ms-be9 is OK: puppet ran at Mon Jul 1 23:09:47 UTC 2013 [23:09:55] RECOVERY - Puppet freshness on wtp1004 is OK: puppet ran at Mon Jul 1 23:09:47 UTC 2013 [23:09:58] RECOVERY - Puppet freshness on cp1036 is OK: puppet ran at Mon Jul 1 23:09:52 UTC 2013 [23:09:58] RECOVERY - Puppet freshness on db1029 is OK: puppet ran at Mon Jul 1 23:09:52 UTC 2013 [23:09:58] RECOVERY - Puppet freshness on mw1188 is OK: puppet ran at Mon Jul 1 23:09:57 UTC 2013 [23:09:59] RECOVERY - Puppet freshness on srv292 is OK: puppet ran at Mon Jul 1 23:09:57 UTC 2013 [23:09:59] RECOVERY - Puppet freshness on srv294 is OK: puppet ran at Mon Jul 1 23:09:57 UTC 2013 [23:10:08] RECOVERY - Puppet freshness on mw1210 is OK: puppet ran at Mon Jul 1 23:09:57 UTC 2013 [23:10:08] RECOVERY - Puppet freshness on mw1171 is OK: puppet ran at Mon Jul 1 23:09:57 UTC 2013 [23:10:08] RECOVERY - Puppet freshness on mw1197 is OK: puppet ran at Mon Jul 1 23:09:57 UTC 2013 [23:10:08] RECOVERY - Puppet freshness on mw1140 is OK: puppet ran at Mon Jul 1 23:09:57 UTC 2013 [23:10:08] RECOVERY - Puppet freshness on search1024 is OK: puppet ran at Mon Jul 1 23:09:57 UTC 2013 [23:10:08] RECOVERY - Puppet freshness on mw1041 is OK: puppet ran at Mon Jul 1 23:10:02 UTC 2013 [23:10:08] RECOVERY - Puppet freshness on mw121 is OK: puppet ran at Mon Jul 1 23:10:02 UTC 2013 [23:10:09] RECOVERY - Puppet freshness on mw1186 is OK: puppet ran at Mon Jul 1 23:10:02 UTC 2013 [23:10:10] RECOVERY - Puppet freshness on cp1049 is OK: puppet ran at Mon Jul 1 23:10:02 UTC 2013 [23:10:38] RECOVERY - Puppet freshness on mc1001 is OK: puppet ran at Mon Jul 1 23:10:28 UTC 2013 [23:10:38] RECOVERY - Puppet freshness on searchidx1001 is OK: puppet ran at Mon Jul 1 23:10:28 UTC 2013 [23:10:38] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Mon Jul 1 23:10:33 UTC 2013 [23:10:38] RECOVERY - Puppet freshness on sq76 is OK: puppet ran at Mon Jul 1 23:10:33 UTC 2013 [23:10:38] RECOVERY - Puppet freshness on ekrem is OK: puppet ran at Mon Jul 1 23:10:33 UTC 2013 [23:10:38] RECOVERY - Puppet freshness on mw1043 is OK: puppet ran at Mon Jul 1 23:10:33 UTC 2013 [23:10:39] RECOVERY - Puppet freshness on cp1005 is OK: puppet ran at Mon Jul 1 23:10:33 UTC 2013 [23:10:39] RECOVERY - Puppet freshness on db34 is OK: puppet ran at Mon Jul 1 23:10:33 UTC 2013 [23:10:48] RECOVERY - Puppet freshness on ms10 is OK: puppet ran at Mon Jul 1 23:10:38 UTC 2013 [23:10:48] RECOVERY - Puppet freshness on cp1002 is OK: puppet ran at Mon Jul 1 23:10:38 UTC 2013 [23:10:48] RECOVERY - Puppet freshness on search18 is OK: puppet ran at Mon Jul 1 23:10:38 UTC 2013 [23:10:48] RECOVERY - Puppet freshness on cp3009 is OK: puppet ran at Mon Jul 1 23:10:38 UTC 2013 [23:10:48] RECOVERY - Puppet freshness on mw58 is OK: puppet ran at Mon Jul 1 23:10:43 UTC 2013 [23:10:48] RECOVERY - Puppet freshness on rdb1002 is OK: puppet ran at Mon Jul 1 23:10:43 UTC 2013 [23:10:49] RECOVERY - Puppet freshness on praseodymium is OK: puppet ran at Mon Jul 1 23:10:43 UTC 2013 [23:10:49] RECOVERY - Puppet freshness on potassium is OK: puppet ran at Mon Jul 1 23:10:43 UTC 2013 [23:10:50] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Mon Jul 1 23:10:43 UTC 2013 [23:10:50] RECOVERY - Puppet freshness on db1001 is OK: puppet ran at Mon Jul 1 23:10:43 UTC 2013 [23:10:50] RECOVERY - Puppet freshness on mw1007 is OK: puppet ran at Mon Jul 1 23:10:43 UTC 2013 [23:10:58] RECOVERY - Puppet freshness on mc1009 is OK: puppet ran at Mon Jul 1 23:10:48 UTC 2013 [23:10:58] RECOVERY - Puppet freshness on ms-be1006 is OK: puppet ran at Mon Jul 1 23:10:48 UTC 2013 [23:10:58] RECOVERY - Puppet freshness on mw1106 is OK: puppet ran at Mon Jul 1 23:10:48 UTC 2013 [23:10:58] RECOVERY - Puppet freshness on mw1087 is OK: puppet ran at Mon Jul 1 23:10:48 UTC 2013 [23:10:58] RECOVERY - Puppet freshness on amssq53 is OK: puppet ran at Mon Jul 1 23:10:48 UTC 2013 [23:10:58] RECOVERY - Puppet freshness on cp1024 is OK: puppet ran at Mon Jul 1 23:10:48 UTC 2013 [23:10:59] RECOVERY - Puppet freshness on mw1063 is OK: puppet ran at Mon Jul 1 23:10:53 UTC 2013 [23:10:59] RECOVERY - Puppet freshness on srv273 is OK: puppet ran at Mon Jul 1 23:10:53 UTC 2013 [23:11:08] RECOVERY - Puppet freshness on cp3012 is OK: puppet ran at Mon Jul 1 23:10:58 UTC 2013 [23:11:08] RECOVERY - Puppet freshness on mw43 is OK: puppet ran at Mon Jul 1 23:10:58 UTC 2013 [23:11:08] RECOVERY - Puppet freshness on mc7 is OK: puppet ran at Mon Jul 1 23:10:58 UTC 2013 [23:11:08] RECOVERY - Puppet freshness on mw124 is OK: puppet ran at Mon Jul 1 23:10:58 UTC 2013 [23:11:08] RECOVERY - Puppet freshness on mw57 is OK: puppet ran at Mon Jul 1 23:10:58 UTC 2013 [23:11:08] RECOVERY - Puppet freshness on helium is OK: puppet ran at Mon Jul 1 23:11:03 UTC 2013 [23:11:09] RECOVERY - Puppet freshness on mw1032 is OK: puppet ran at Mon Jul 1 23:11:03 UTC 2013 [23:11:09] RECOVERY - Puppet freshness on mc1007 is OK: puppet ran at Mon Jul 1 23:11:03 UTC 2013 [23:11:10] RECOVERY - Puppet freshness on db1031 is OK: puppet ran at Mon Jul 1 23:11:03 UTC 2013 [23:11:10] RECOVERY - Puppet freshness on wtp1015 is OK: puppet ran at Mon Jul 1 23:11:04 UTC 2013 [23:11:18] RECOVERY - Puppet freshness on db39 is OK: puppet ran at Mon Jul 1 23:11:09 UTC 2013 [23:11:28] RECOVERY - Puppet freshness on srv255 is OK: puppet ran at Mon Jul 1 23:11:24 UTC 2013 [23:11:38] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [23:11:38] RECOVERY - Puppet freshness on sq58 is OK: puppet ran at Mon Jul 1 23:11:29 UTC 2013 [23:11:38] RECOVERY - Puppet freshness on sq54 is OK: puppet ran at Mon Jul 1 23:11:29 UTC 2013 [23:11:38] RECOVERY - Puppet freshness on mc10 is OK: puppet ran at Mon Jul 1 23:11:34 UTC 2013 [23:11:38] RECOVERY - Puppet freshness on analytics1014 is OK: puppet ran at Mon Jul 1 23:11:34 UTC 2013 [23:11:38] RECOVERY - Puppet freshness on rdb1003 is OK: puppet ran at Mon Jul 1 23:11:34 UTC 2013 [23:11:38] RECOVERY - Puppet freshness on mw1048 is OK: puppet ran at Mon Jul 1 23:11:34 UTC 2013 [23:11:39] RECOVERY - Puppet freshness on srv245 is OK: puppet ran at Mon Jul 1 23:11:34 UTC 2013 [23:11:48] RECOVERY - Puppet freshness on mc1 is OK: puppet ran at Mon Jul 1 23:11:39 UTC 2013 [23:11:48] RECOVERY - Puppet freshness on db63 is OK: puppet ran at Mon Jul 1 23:11:44 UTC 2013 [23:11:48] RECOVERY - Puppet freshness on wtp1007 is OK: puppet ran at Mon Jul 1 23:11:44 UTC 2013 [23:11:48] RECOVERY - Puppet freshness on ms-fe1001 is OK: puppet ran at Mon Jul 1 23:11:44 UTC 2013 [23:11:58] RECOVERY - Puppet freshness on db1022 is OK: puppet ran at Mon Jul 1 23:11:54 UTC 2013 [23:12:08] RECOVERY - Puppet freshness on srv193 is OK: puppet ran at Mon Jul 1 23:11:59 UTC 2013 [23:12:08] RECOVERY - Puppet freshness on mw79 is OK: puppet ran at Mon Jul 1 23:12:04 UTC 2013 [23:12:08] RECOVERY - Puppet freshness on mw1122 is OK: puppet ran at Mon Jul 1 23:12:04 UTC 2013 [23:12:18] RECOVERY - Puppet freshness on search33 is OK: puppet ran at Mon Jul 1 23:12:10 UTC 2013 [23:12:48] RECOVERY - Puppet freshness on pc1 is OK: puppet ran at Mon Jul 1 23:12:45 UTC 2013 [23:12:58] RECOVERY - Puppet freshness on mw8 is OK: puppet ran at Mon Jul 1 23:12:50 UTC 2013 [23:15:08] RECOVERY - Puppet freshness on cp1030 is OK: puppet ran at Mon Jul 1 23:14:57 UTC 2013 [23:41:17] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/71540 [23:49:52] !log logsmsgbot's cidr-filtering was broken by https://gerrit.wikimedia.org/r/#/c/69624/, manifested now because of neon restart earlier [23:50:01] Logged the message, Master [23:57:34] LeslieCarr: could you possibly pastebin /srv/tcpircbot/logsmsgbot.json on neon? [23:58:07] er, in private [23:58:49] LeslieCarr: i actually just want to verify the value of 'cidr', which is not private [23:59:25] /srv/tcpircbot/logsmsgbot.json: No such file or directory [23:59:33] yep [23:59:40] oh wait it's there [23:59:46] i misspelled it [23:59:48] logmsgbot.json [23:59:52] logmsgbot (one less s) [23:59:56] yep, sorry :)