[00:03:37] (03Abandoned) 10Cmjohnson: Revert "removing mw1125 from dsh files- new hard drive has been installed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88674 (owner: 10Cmjohnson) [00:05:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 00:05:33 UTC 2013 [00:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [00:35:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 00:35:23 UTC 2013 [00:35:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [00:49:22] !log krinkle synchronized php-1.22wmf10/extensions/VisualEditor 'I85bce4d40e430318' [00:49:37] Logged the message, Master [00:51:41] AaronSchulz: wmf20, WikimediaMessages is dirty (commit that isn't in the origin repo, though a similar commit did get merged but has a different hash) [00:51:49] 2 commits ahead, ~ 20 commits behind [00:52:01] a commit from you, just fyi [00:52:10] wmf19, not wmf20 sorry [00:52:51] !log krinkle synchronized php-1.22wmf19/extensions/VisualEditor 'I03d68280ddd9506' [00:53:05] Logged the message, Master [00:57:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [00:58:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 23.996 second response time [01:00:22] Krinkle: I think that's for csteipp to sort out; it was part of the response to bug 54847 [01:01:02] it was a security patch initially [01:04:00] ori-l: they're both in gerrit (under a different commit hash) [01:04:09] not sure why Ib3e32cac1426f0dbeb55f872961fc8c87380c180 was uncomitted there, seems a trivial fix [01:04:18] mostly by association [01:05:22] I'm heading home but I'll tidy it up in a few hours if Chris doesn't beat me to the punch. It shouldn't fall on Aaron's head just because he was nice to submit the patch in the first place. [01:06:15] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 01:06:06 UTC 2013 [01:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [01:07:33] ori-l: sure, no problem. Upon further investigation it should all make sense, just annoying to see [01:35:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 01:35:25 UTC 2013 [01:35:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [02:05:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 02:05:45 UTC 2013 [02:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [02:16:09] !log LocalisationUpdate completed (1.22wmf20) at Wed Oct 9 02:16:08 UTC 2013 [02:16:26] Logged the message, Master [02:30:07] !log LocalisationUpdate completed (1.22wmf19) at Wed Oct 9 02:30:06 UTC 2013 [02:30:21] Logged the message, Master [02:35:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 02:35:31 UTC 2013 [02:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [02:45:00] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Oct 9 02:44:59 UTC 2013 [02:45:15] Logged the message, Master [03:05:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 03:05:21 UTC 2013 [03:05:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [03:35:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 03:35:46 UTC 2013 [03:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [04:05:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 04:05:25 UTC 2013 [04:05:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [04:08:15] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on bast4001 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4001 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4002 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4003 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4004 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4005 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4006 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:16] PROBLEM - Puppet freshness on cp4007 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:16] PROBLEM - Puppet freshness on cp4008 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:17] PROBLEM - Puppet freshness on cp4009 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:17] PROBLEM - Puppet freshness on cp4010 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:18] PROBLEM - Puppet freshness on cp4011 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:18] PROBLEM - Puppet freshness on cp4012 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:19] PROBLEM - Puppet freshness on cp4013 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:19] PROBLEM - Puppet freshness on cp4014 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:20] PROBLEM - Puppet freshness on cp4015 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:20] PROBLEM - Puppet freshness on cp4016 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:21] PROBLEM - Puppet freshness on cp4017 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:21] PROBLEM - Puppet freshness on cp4018 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:22] PROBLEM - Puppet freshness on cp4019 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:22] PROBLEM - Puppet freshness on cp4020 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:23] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:23] PROBLEM - Puppet freshness on lvs4002 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:24] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:24] PROBLEM - Puppet freshness on lvs4004 is CRITICAL: No successful Puppet run in the last 10 hours [04:26:15] PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 10 hours [04:38:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 04:38:52 UTC 2013 [04:39:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [05:05:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 05:05:51 UTC 2013 [05:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [05:16:15] PROBLEM - Puppet freshness on db1051 is CRITICAL: No successful Puppet run in the last 10 hours [05:16:15] PROBLEM - Puppet freshness on db54 is CRITICAL: No successful Puppet run in the last 10 hours [05:17:15] PROBLEM - Puppet freshness on mw1054 is CRITICAL: No successful Puppet run in the last 10 hours [05:17:15] PROBLEM - Puppet freshness on mw1154 is CRITICAL: No successful Puppet run in the last 10 hours [05:17:15] PROBLEM - Puppet freshness on mw1155 is CRITICAL: No successful Puppet run in the last 10 hours [05:17:15] PROBLEM - Puppet freshness on mw55 is CRITICAL: No successful Puppet run in the last 10 hours [05:17:15] PROBLEM - Puppet freshness on sq52 is CRITICAL: No successful Puppet run in the last 10 hours [05:18:55] RECOVERY - Puppet freshness on osm-cp1001 is OK: puppet ran at Wed Oct 9 05:18:53 UTC 2013 [05:20:15] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 10 hours [05:20:15] PROBLEM - Puppet freshness on db57 is CRITICAL: No successful Puppet run in the last 10 hours [05:20:15] PROBLEM - Puppet freshness on mw1053 is CRITICAL: No successful Puppet run in the last 10 hours [05:20:15] PROBLEM - Puppet freshness on sq50 is CRITICAL: No successful Puppet run in the last 10 hours [05:21:15] PROBLEM - Puppet freshness on db55 is CRITICAL: No successful Puppet run in the last 10 hours [05:21:15] PROBLEM - Puppet freshness on db51 is CRITICAL: No successful Puppet run in the last 10 hours [05:21:15] PROBLEM - Puppet freshness on mw53 is CRITICAL: No successful Puppet run in the last 10 hours [05:24:15] PROBLEM - Puppet freshness on db56 is CRITICAL: No successful Puppet run in the last 10 hours [05:24:15] PROBLEM - Puppet freshness on mw1059 is CRITICAL: No successful Puppet run in the last 10 hours [05:24:16] PROBLEM - Puppet freshness on sq59 is CRITICAL: No successful Puppet run in the last 10 hours [05:26:15] PROBLEM - Puppet freshness on db1056 is CRITICAL: No successful Puppet run in the last 10 hours [05:27:15] PROBLEM - Puppet freshness on sq56 is CRITICAL: No successful Puppet run in the last 10 hours [05:27:15] PROBLEM - Puppet freshness on srv257 is CRITICAL: No successful Puppet run in the last 10 hours [05:28:15] PROBLEM - Puppet freshness on mw56 is CRITICAL: No successful Puppet run in the last 10 hours [05:28:15] PROBLEM - Puppet freshness on srv252 is CRITICAL: No successful Puppet run in the last 10 hours [05:29:15] PROBLEM - Puppet freshness on sq57 is CRITICAL: No successful Puppet run in the last 10 hours [05:30:15] PROBLEM - Puppet freshness on sq53 is CRITICAL: No successful Puppet run in the last 10 hours [05:31:15] PROBLEM - Puppet freshness on db50 is CRITICAL: No successful Puppet run in the last 10 hours [05:31:15] PROBLEM - Puppet freshness on mw1151 is CRITICAL: No successful Puppet run in the last 10 hours [05:32:15] PROBLEM - Puppet freshness on db1059 is CRITICAL: No successful Puppet run in the last 10 hours [05:34:15] PROBLEM - Puppet freshness on mw1152 is CRITICAL: No successful Puppet run in the last 10 hours [05:35:15] PROBLEM - Puppet freshness on srv256 is CRITICAL: No successful Puppet run in the last 10 hours [05:35:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 05:35:30 UTC 2013 [05:36:15] PROBLEM - Puppet freshness on mw54 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:15] PROBLEM - Puppet freshness on db52 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [05:38:15] PROBLEM - Puppet freshness on mw1158 is CRITICAL: No successful Puppet run in the last 10 hours [05:39:15] PROBLEM - Puppet freshness on srv258 is CRITICAL: No successful Puppet run in the last 10 hours [05:40:15] PROBLEM - Puppet freshness on db1052 is CRITICAL: No successful Puppet run in the last 10 hours [05:41:15] PROBLEM - Puppet freshness on cp1051 is CRITICAL: No successful Puppet run in the last 10 hours [05:42:15] PROBLEM - Puppet freshness on cp1050 is CRITICAL: No successful Puppet run in the last 10 hours [05:42:15] PROBLEM - Puppet freshness on sq51 is CRITICAL: No successful Puppet run in the last 10 hours [05:43:15] PROBLEM - Puppet freshness on mw1057 is CRITICAL: No successful Puppet run in the last 10 hours [05:43:15] PROBLEM - Puppet freshness on srv253 is CRITICAL: No successful Puppet run in the last 10 hours [05:45:15] PROBLEM - Puppet freshness on sq54 is CRITICAL: No successful Puppet run in the last 10 hours [05:45:15] PROBLEM - Puppet freshness on sq58 is CRITICAL: No successful Puppet run in the last 10 hours [05:48:15] PROBLEM - Puppet freshness on mw51 is CRITICAL: No successful Puppet run in the last 10 hours [05:51:15] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [05:54:15] PROBLEM - Puppet freshness on sq55 is CRITICAL: No successful Puppet run in the last 10 hours [05:55:15] PROBLEM - Puppet freshness on mw52 is CRITICAL: No successful Puppet run in the last 10 hours [05:55:15] PROBLEM - Puppet freshness on srv259 is CRITICAL: No successful Puppet run in the last 10 hours [05:57:15] PROBLEM - Puppet freshness on db59 is CRITICAL: No successful Puppet run in the last 10 hours [05:57:15] PROBLEM - Puppet freshness on mw59 is CRITICAL: No successful Puppet run in the last 10 hours [05:57:15] PROBLEM - Puppet freshness on srv251 is CRITICAL: No successful Puppet run in the last 10 hours [05:58:15] PROBLEM - Puppet freshness on mw1157 is CRITICAL: No successful Puppet run in the last 10 hours [05:59:15] PROBLEM - Puppet freshness on mw1058 is CRITICAL: No successful Puppet run in the last 10 hours [06:01:15] PROBLEM - Puppet freshness on mw1055 is CRITICAL: No successful Puppet run in the last 10 hours [06:01:15] PROBLEM - Puppet freshness on srv254 is CRITICAL: No successful Puppet run in the last 10 hours [06:02:15] PROBLEM - Puppet freshness on mw1051 is CRITICAL: No successful Puppet run in the last 10 hours [06:02:15] PROBLEM - Puppet freshness on mw1156 is CRITICAL: No successful Puppet run in the last 10 hours [06:03:15] PROBLEM - Puppet freshness on mw1050 is CRITICAL: No successful Puppet run in the last 10 hours [06:05:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 06:05:18 UTC 2013 [06:05:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [06:09:15] PROBLEM - Puppet freshness on mw1052 is CRITICAL: No successful Puppet run in the last 10 hours [06:09:15] PROBLEM - Puppet freshness on srv250 is CRITICAL: No successful Puppet run in the last 10 hours [06:11:15] PROBLEM - Puppet freshness on db53 is CRITICAL: No successful Puppet run in the last 10 hours [06:12:15] PROBLEM - Puppet freshness on mw1056 is CRITICAL: No successful Puppet run in the last 10 hours [06:12:15] PROBLEM - Puppet freshness on mw1159 is CRITICAL: No successful Puppet run in the last 10 hours [06:12:43] !log aaron synchronized php-1.22wmf19/includes 'c6d64f5488c124636ab46712e1d104e5c7076325' [06:13:00] Logged the message, Master [06:13:15] PROBLEM - Puppet freshness on db58 is CRITICAL: No successful Puppet run in the last 10 hours [06:13:15] PROBLEM - Puppet freshness on mw57 is CRITICAL: No successful Puppet run in the last 10 hours [06:13:15] PROBLEM - Puppet freshness on mw58 is CRITICAL: No successful Puppet run in the last 10 hours [06:13:15] PROBLEM - Puppet freshness on srv255 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:15] PROBLEM - Puppet freshness on db1050 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:15] PROBLEM - Puppet freshness on mw1150 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:15] PROBLEM - Puppet freshness on mw1153 is CRITICAL: No successful Puppet run in the last 10 hours [06:35:45] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 06:35:44 UTC 2013 [06:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [07:05:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 07:05:20 UTC 2013 [07:05:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [07:07:56] (03PS1) 10Springle: repool db1022 in S6. move db1039 to assist with upgrades in S7. depool db1007 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88691 [07:08:46] (03CR) 10Springle: [C: 032] repool db1022 in S6. move db1039 to assist with upgrades in S7. depool db1007 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88691 (owner: 10Springle) [07:18:47] !log springle synchronized wmf-config/db-eqiad.php 'repool db1022 in S6. move db1039 to assist with upgrades in S7. depool db1007 for upgrade' [07:18:58] Logged the message, Master [07:35:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 07:35:53 UTC 2013 [07:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [07:51:33] (03PS1) 10Ori.livneh: Navigation Timing: differentiate by auth status rather than wiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/88692 [07:53:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [07:54:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 12.396 second response time [07:57:23] mutante: I am awake :-] [07:57:35] hashar: pong [07:58:11] mutante: rebooting -:] [07:58:18] enjoy your breakfast [07:58:21] I am getting my coffee [07:58:22] connecting to mgmt [07:58:35] ohh [07:58:39] there is another kernel :-) [07:59:00] ah, let's get that too [08:00:18] okk [08:01:06] bah it is merely "Bump ABI" [08:02:16] ah yea, so no new features but can still be fixes [08:02:32] !log rebooting gallium [08:02:43] Logged the message, Master [08:03:15] PROBLEM - Puppet freshness on amssq49 is CRITICAL: No successful Puppet run in the last 10 hours [08:04:29] now we "just" have to wait for fsck :D [08:05:45] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 08:05:42 UTC 2013 [08:06:25] PROBLEM - zuul_service_running on gallium is CRITICAL: Connection refused by host [08:06:25] PROBLEM - SSH on gallium is CRITICAL: Connection refused [08:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [08:06:35] PROBLEM - HTTP on gallium is CRITICAL: Connection refused [08:06:55] PROBLEM - jenkins_service_running on gallium is CRITICAL: Connection refused by host [08:07:19] mutante: ^^^^^ it does :-] [08:07:33] though I have no idea whether that emits pages [08:08:14] the zuul service just notify by irc https://icinga.wikimedia.org/cgi-bin/icinga/notifications.cgi?host=gallium&service=zuul_service_running [08:08:19] hashar: the jekins_service_running does not [08:08:32] monitor_service would have to have critical => true for that [08:08:55] paging is probably unneeded [08:08:59] it is not that critical [08:09:05] nod [08:09:12] at worth devs will complain for a couple hours until some root / eng folk get to restart the services [08:09:23] hmm, i dont really see output on console [08:09:39] maybe reconnect? [08:09:52] apparently NTP is back up [08:10:05] did and gets the garbled output [08:12:40] wth.. mgmt refuses connection after reset [08:12:44] bahh [08:12:48] narf [08:13:05] ah wait, slooow [08:13:43] 583.913308] SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) [08:14:15] PROBLEM - Puppet freshness on amssq48 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:21] it looks stuck,i'd had to show you a screenshot .. [08:14:32] -1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. [08:15:17] if you stare closely at it, it becomes 3D [08:15:41] hashar: let me powercycle it .. sigh [08:16:02] sorry :( [08:16:06] morning hashar / mutante btw [08:16:14] no reason to be sorry [08:16:47] actually sees BIOS and booting normally now [08:16:54] ori-l: hello :-) [08:16:57] hi ori [08:17:15] PROBLEM - Puppet freshness on amssq51 is CRITICAL: No successful Puppet run in the last 10 hours [08:17:27] /dev/md0 has gone 470 days without being checked, check forced. [08:17:27] Checking disk drives for errors. This may take several minutes. [08:17:30] there we go [08:17:41] yeah several hours would be a more accurate message [08:17:45] danke! [08:18:46] 470 days .. [08:18:51] ori-l: I send a bunch of changes to carbon aggregation and added you as a reviewer [08:18:58] but we rebooted earlier than that [08:19:03] ori-l: not really sure what the impacts are though :( [08:19:16] mutante: yeah a few days ago it was 143~ days uptime [08:19:31] hashar: i only saw the comment change [08:19:42] sounds like we gave up/canceled fsck another time [08:19:50] ori-l: and it is not really a priority :-] [08:20:18] ori-l: also yesterday we had a release/QA weekly checking, we talked a bit about your exception/Fatal to json system. Greg is supposed to reach out with you eventually [08:20:29] ori-l: a good plan would be to enable it on beta cluster for people to play with :] [08:20:45] PROBLEM - DPKG on vanadium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:20:59] mutante: I guess you can context switch to some other small task :] [08:21:12] mutante: I am tailing syslog for gallium ip so will get aware whenever it is back [08:21:15] heh, ok, task, make coffee [08:21:24] cool [08:21:45] RECOVERY - DPKG on vanadium is OK: All packages OK [08:22:09] discovery: the icinga dpkg check fails if it happens to run during apt-get upgrade [08:23:17] ahh good catch [08:24:06] !log Stopping EventLogging, then rebooting vanadium for kernel upgrade [08:24:15] PROBLEM - Puppet freshness on amssq50 is CRITICAL: No successful Puppet run in the last 10 hours [08:24:17] Logged the message, Master [08:24:31] it basically just does dpkg -l | grep -v ^ii [08:25:18] eh, anything that is not ii or rc it doesnt like afair [08:25:46] must be kind of lucky to catch it during an upgrade though [08:27:55] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/vanadium consumer/server-side-events-log consumer/mysql-db1047 consumer/client-side-events-log consumer/all-events-log multiplexer/all-events processor/server-side-events processor/client-side-events forwarder/8422 forwarder/8421 [08:30:18] ori-l: event logging jobs dead ^^^ [08:30:31] hashar: see !log above [08:30:55] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [08:31:12] ori-l: sorry for INT [08:31:26] no problem at all [08:36:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 08:35:56 UTC 2013 [08:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [08:38:35] graphite metric represented as a "stock ticker" : http://square.github.io/cubism/ :D [08:39:02] i actually had that set up on labs at one point for edits [08:39:14] * hashar feels useless [08:39:19] it's not bad but horizon charts a bit hard to read [08:39:41] anyone want to help looking at puppet freshness issue? db1051 (and various other hosts) as of about 19:45 last night, [08:39:43] i find it nice to represent a ton of data at the sametime [08:39:47] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to parse template squid/squid-disk-permissions.erb: Could not find value for 'squid_coss_disks' at 5:/etc/puppet/templates/squid/squid-disk-permissions.erb at /etc/puppet/manifests/squid.pp:67 on node db1051.eqiad.wmnet [08:39:58] apergos: i can look [08:40:05] apergos: why would we have squid on a db server? [08:40:12] saw nothing suspicious in git log or on puppetmaster, another pair of eyes would help [08:40:34] well what's better is that some of the nodes in the same node defn run fine (unless I am reading it wrong) [08:40:51] there's a number of other servers all with the same error, see icinga, some squids and some are dbs [08:41:28] there's no reason on earth we would have hashar and afaict we didn't have (nor do we on the dbs that run successfully) [08:41:56] i.e. I did root@db1017:~# grep squid /var/lib/puppet/state/classes.txt and got nothing back [08:42:05] but now I need to shut up and let someone fresh look at it [08:42:16] PROBLEM - Puppet freshness on amssq52 is CRITICAL: No successful Puppet run in the last 10 hours [08:42:16] PROBLEM - Puppet freshness on amssq53 is CRITICAL: No successful Puppet run in the last 10 hours [08:42:32] apergos: I found that sometime using """ puppetd -tv --evaltrace "" helps [08:42:44] that shows the class being processed by puppet [08:42:52] first thing that stands out (but is probably unrelated) is $squid_coss_disks = split(get_var('squid_coss_disks'), ',') [08:42:57] i was like, 'what the hell is get_var' [08:43:09] so i googled it and what do i find if not a bug report from hashar :P [08:43:11] https://bugzilla.wikimedia.org/show_bug.cgi?id=38524 [08:43:25] hehe [08:43:32] yeah I came across that on beta [08:43:44] but that should be labs related [08:43:46] it appears to be something from a media temple puppet module that was carelessly imported [08:43:58] "Unknown function get_var " [08:44:00] yeah, not causing the issue apergos is seeing, but still worth flagging since it's broken code [08:44:02] so if you look at a squid node, you see how it goes, and (most of ) those erun fine [08:44:45] there is one sure thing which is that db1051 should not have the squid manifest applied [08:44:59] I have no clue how to simulate a run of puppet given a node named 'db1051' [08:46:24] * apergos tries evaltrace [08:46:32] didn't know about that option [08:47:03] nope, no class listed before the error, too bad [08:48:52] apergos: I came across evaltrace a while back and listed it on the [[Puppet]] wikitect article https://wikitech.wikimedia.org/wiki/Puppet#Debugging [08:49:18] well we aren't into that I guess, still waiting for the server to complete parsing [08:49:22] so problematic :-( [08:50:44] apergos: it happens repeatedly and consistently on db1051? [08:50:54] on a pile of hossts [08:51:14] oh yeah, i see the list in the scrollback [08:51:17] https://icinga.wikimedia.org/icinga/ check the passive checks [08:51:31] under critical [08:52:13] notice something interesting? [08:52:38] all the fails are hosts with 5*, 105*, 115* 25* [08:52:41] apergos: icinga.. of all those Puppet freshness checks, many are disabled [08:52:47] and 20 are new ULSFO hosts [08:53:12] talking about mw, srv, db, sq [08:53:24] could be wrong regex in site.pp [08:53:55] (03PS1) 10Addshore: Various user rights config changes on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88694 [08:53:56] node /amssq4[7-9]|5[0-9]|6[0-2]\.esams\.wikimedia\.org$/ { [08:54:26] 724df4d8d55750b6018bf49424762e577efef4ec [08:54:35] ouch [08:54:36] needs a paren [08:55:12] yep [08:55:16] nice catch [08:55:24] no jenkins right now..cough [08:55:28] let me get this in [08:56:29] \O/ [08:56:37] (03PS1) 10ArielGlenn: fix up amssq node expr (caught non squid hosts) [operations/puppet] - 10https://gerrit.wikimedia.org/r/88695 [08:56:44] you will get to force merge it [08:56:47] jenkins busy rebooting [08:56:48] good catch! [08:56:58] somone else wanna double check me please? [08:57:25] I've been staring at this for a long time now so my eyes are tired [08:57:37] but you said the magic words regexp and then it was obvious [08:58:15] PROBLEM - Puppet freshness on amssq58 is CRITICAL: No successful Puppet run in the last 10 hours [08:58:30] I have no clue how to simulate a puppet run for a given node though :( [08:58:35] it looks right. i'd add a ^ at the beginning but it's not critical, mostly a style thing. [08:58:57] looks good, like the example above it [08:59:17] just that it ends in $ and the other doesn't , but same what ori said [09:00:14] (03CR) 10ArielGlenn: [C: 032 V: 032] fix up amssq node expr (caught non squid hosts) [operations/puppet] - 10https://gerrit.wikimedia.org/r/88695 (owner: 10ArielGlenn) [09:00:17] meh, merging anyways, worst that happens we break some more hosts :-P [09:03:15] PROBLEM - Puppet freshness on amssq54 is CRITICAL: No successful Puppet run in the last 10 hours [09:03:26] yeah yeah hush [09:03:35] RECOVERY - Puppet freshness on db1051 is OK: puppet ran at Wed Oct 9 09:03:32 UTC 2013 [09:04:15] PROBLEM - Puppet freshness on amssq61 is CRITICAL: No successful Puppet run in the last 10 hours [09:04:35] RECOVERY - Puppet freshness on db52 is OK: puppet ran at Wed Oct 9 09:04:32 UTC 2013 [09:04:52] guess I'll wait for the dust to settle and see what's left [09:05:05] RECOVERY - Puppet freshness on mw54 is OK: puppet ran at Wed Oct 9 09:04:57 UTC 2013 [09:05:05] RECOVERY - Puppet freshness on mw51 is OK: puppet ran at Wed Oct 9 09:05:02 UTC 2013 [09:05:17] time for breakfast [09:06:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 09:05:57 UTC 2013 [09:06:15] PROBLEM - Puppet freshness on amssq55 is CRITICAL: No successful Puppet run in the last 10 hours [09:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [09:06:55] RECOVERY - Puppet freshness on srv258 is OK: puppet ran at Wed Oct 9 09:06:48 UTC 2013 [09:06:55] RECOVERY - Puppet freshness on mw1052 is OK: puppet ran at Wed Oct 9 09:06:53 UTC 2013 [09:07:05] RECOVERY - Puppet freshness on srv250 is OK: puppet ran at Wed Oct 9 09:06:58 UTC 2013 [09:07:35] RECOVERY - Puppet freshness on db53 is OK: puppet ran at Wed Oct 9 09:07:28 UTC 2013 [09:07:43] :) [09:08:16] PROBLEM - Puppet freshness on amssq57 is CRITICAL: No successful Puppet run in the last 10 hours [09:08:16] RECOVERY - Puppet freshness on cp1051 is OK: puppet ran at Wed Oct 9 09:08:14 UTC 2013 [09:08:20] apergos: nice fix, enjoy [09:08:25] hashar: it's coming back:) [09:08:26] RECOVERY - SSH on gallium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [09:08:30] ^ [09:08:34] mutante: hurrah [09:08:36] RECOVERY - Puppet freshness on cp1050 is OK: puppet ran at Wed Oct 9 09:08:34 UTC 2013 [09:08:36] RECOVERY - Puppet freshness on db1052 is OK: puppet ran at Wed Oct 9 09:08:34 UTC 2013 [09:08:47] gallium login: [09:08:54] I am on [09:08:55] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [09:09:25] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/bin/zuul-server [09:09:35] RECOVERY - HTTP on gallium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 563 bytes in 0.001 second response time [09:09:47] HTTPError: HTTP Error 503: Service Temporarily Unavailable [09:09:48] hehe [09:09:52] jenkins busy restarting [09:09:52] ha! [09:10:25] RECOVERY - Puppet freshness on mw1057 is OK: puppet ran at Wed Oct 9 09:10:15 UTC 2013 [09:10:34] !log gallium : zuul backup, waiting for jenkins to complete start up [09:10:45] RECOVERY - Puppet freshness on mw1159 is OK: puppet ran at Wed Oct 9 09:10:35 UTC 2013 [09:10:45] RECOVERY - Puppet freshness on db58 is OK: puppet ran at Wed Oct 9 09:10:35 UTC 2013 [09:10:45] RECOVERY - Puppet freshness on mw1056 is OK: puppet ran at Wed Oct 9 09:10:40 UTC 2013 [09:10:45] Logged the message, Master [09:10:55] RECOVERY - Puppet freshness on srv253 is OK: puppet ran at Wed Oct 9 09:10:50 UTC 2013 [09:12:05] RECOVERY - Puppet freshness on srv255 is OK: puppet ran at Wed Oct 9 09:11:55 UTC 2013 [09:12:15] RECOVERY - Puppet freshness on sq58 is OK: puppet ran at Wed Oct 9 09:12:05 UTC 2013 [09:12:15] RECOVERY - Puppet freshness on mw57 is OK: puppet ran at Wed Oct 9 09:12:05 UTC 2013 [09:12:25] RECOVERY - Puppet freshness on mw58 is OK: puppet ran at Wed Oct 9 09:12:21 UTC 2013 [09:13:05] RECOVERY - Puppet freshness on db1050 is OK: puppet ran at Wed Oct 9 09:13:01 UTC 2013 [09:13:55] RECOVERY - Puppet freshness on sq54 is OK: puppet ran at Wed Oct 9 09:13:46 UTC 2013 [09:13:55] RECOVERY - Puppet freshness on srv256 is OK: puppet ran at Wed Oct 9 09:13:51 UTC 2013 [09:14:55] RECOVERY - Puppet freshness on db54 is OK: puppet ran at Wed Oct 9 09:14:51 UTC 2013 [09:15:05] RECOVERY - Puppet freshness on mw1150 is OK: puppet ran at Wed Oct 9 09:14:56 UTC 2013 [09:15:05] RECOVERY - Puppet freshness on sq52 is OK: puppet ran at Wed Oct 9 09:15:01 UTC 2013 [09:15:15] RECOVERY - Puppet freshness on mw1153 is OK: puppet ran at Wed Oct 9 09:15:11 UTC 2013 [09:15:55] RECOVERY - Puppet freshness on mw1054 is OK: puppet ran at Wed Oct 9 09:15:52 UTC 2013 [09:15:55] RECOVERY - Puppet freshness on mw55 is OK: puppet ran at Wed Oct 9 09:15:52 UTC 2013 [09:16:25] RECOVERY - Puppet freshness on mw1155 is OK: puppet ran at Wed Oct 9 09:16:22 UTC 2013 [09:16:26] RECOVERY - Puppet freshness on mw1154 is OK: puppet ran at Wed Oct 9 09:16:22 UTC 2013 [09:17:55] RECOVERY - Puppet freshness on sq51 is OK: puppet ran at Wed Oct 9 09:17:52 UTC 2013 [09:19:05] RECOVERY - Puppet freshness on sq50 is OK: puppet ran at Wed Oct 9 09:18:58 UTC 2013 [09:19:15] RECOVERY - Puppet freshness on db51 is OK: puppet ran at Wed Oct 9 09:19:08 UTC 2013 [09:19:25] RECOVERY - Puppet freshness on db1058 is OK: puppet ran at Wed Oct 9 09:19:23 UTC 2013 [09:19:55] RECOVERY - Puppet freshness on mw1053 is OK: puppet ran at Wed Oct 9 09:19:48 UTC 2013 [09:20:05] RECOVERY - Puppet freshness on db55 is OK: puppet ran at Wed Oct 9 09:19:59 UTC 2013 [09:20:05] RECOVERY - Puppet freshness on db57 is OK: puppet ran at Wed Oct 9 09:20:04 UTC 2013 [09:20:15] PROBLEM - Puppet freshness on amssq56 is CRITICAL: No successful Puppet run in the last 10 hours [09:21:05] RECOVERY - Puppet freshness on mw53 is OK: puppet ran at Wed Oct 9 09:21:04 UTC 2013 [09:21:05] RECOVERY - Puppet freshness on sq55 is OK: puppet ran at Wed Oct 9 09:21:04 UTC 2013 [09:21:15] PROBLEM - Puppet freshness on amssq59 is CRITICAL: No successful Puppet run in the last 10 hours [09:22:05] RECOVERY - Puppet freshness on db59 is OK: puppet ran at Wed Oct 9 09:21:59 UTC 2013 [09:22:55] RECOVERY - Puppet freshness on srv259 is OK: puppet ran at Wed Oct 9 09:22:45 UTC 2013 [09:22:55] RECOVERY - Puppet freshness on mw52 is OK: puppet ran at Wed Oct 9 09:22:45 UTC 2013 [09:23:15] RECOVERY - Puppet freshness on sq59 is OK: puppet ran at Wed Oct 9 09:23:10 UTC 2013 [09:23:45] RECOVERY - Puppet freshness on mw1059 is OK: puppet ran at Wed Oct 9 09:23:40 UTC 2013 [09:23:45] RECOVERY - Puppet freshness on db56 is OK: puppet ran at Wed Oct 9 09:23:40 UTC 2013 [09:24:05] RECOVERY - Puppet freshness on db1056 is OK: puppet ran at Wed Oct 9 09:23:55 UTC 2013 [09:24:45] RECOVERY - Puppet freshness on mw59 is OK: puppet ran at Wed Oct 9 09:24:40 UTC 2013 [09:25:05] RECOVERY - Puppet freshness on srv251 is OK: puppet ran at Wed Oct 9 09:24:55 UTC 2013 [09:25:45] RECOVERY - Puppet freshness on mw56 is OK: puppet ran at Wed Oct 9 09:25:36 UTC 2013 [09:25:45] RECOVERY - Puppet freshness on srv257 is OK: puppet ran at Wed Oct 9 09:25:41 UTC 2013 [09:25:45]