[00:03:37] (03Abandoned) 10Cmjohnson: Revert "removing mw1125 from dsh files- new hard drive has been installed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88674 (owner: 10Cmjohnson) [00:05:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 00:05:33 UTC 2013 [00:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [00:35:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 00:35:23 UTC 2013 [00:35:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [00:49:22] !log krinkle synchronized php-1.22wmf10/extensions/VisualEditor 'I85bce4d40e430318' [00:49:37] Logged the message, Master [00:51:41] AaronSchulz: wmf20, WikimediaMessages is dirty (commit that isn't in the origin repo, though a similar commit did get merged but has a different hash) [00:51:49] 2 commits ahead, ~ 20 commits behind [00:52:01] a commit from you, just fyi [00:52:10] wmf19, not wmf20 sorry [00:52:51] !log krinkle synchronized php-1.22wmf19/extensions/VisualEditor 'I03d68280ddd9506' [00:53:05] Logged the message, Master [00:57:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [00:58:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 23.996 second response time [01:00:22] Krinkle: I think that's for csteipp to sort out; it was part of the response to bug 54847 [01:01:02] it was a security patch initially [01:04:00] ori-l: they're both in gerrit (under a different commit hash) [01:04:09] not sure why Ib3e32cac1426f0dbeb55f872961fc8c87380c180 was uncomitted there, seems a trivial fix [01:04:18] mostly by association [01:05:22] I'm heading home but I'll tidy it up in a few hours if Chris doesn't beat me to the punch. It shouldn't fall on Aaron's head just because he was nice to submit the patch in the first place. [01:06:15] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 01:06:06 UTC 2013 [01:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [01:07:33] ori-l: sure, no problem. Upon further investigation it should all make sense, just annoying to see [01:35:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 01:35:25 UTC 2013 [01:35:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [02:05:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 02:05:45 UTC 2013 [02:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [02:16:09] !log LocalisationUpdate completed (1.22wmf20) at Wed Oct 9 02:16:08 UTC 2013 [02:16:26] Logged the message, Master [02:30:07] !log LocalisationUpdate completed (1.22wmf19) at Wed Oct 9 02:30:06 UTC 2013 [02:30:21] Logged the message, Master [02:35:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 02:35:31 UTC 2013 [02:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [02:45:00] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Oct 9 02:44:59 UTC 2013 [02:45:15] Logged the message, Master [03:05:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 03:05:21 UTC 2013 [03:05:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [03:35:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 03:35:46 UTC 2013 [03:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [04:05:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 04:05:25 UTC 2013 [04:05:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [04:08:15] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on bast4001 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4001 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4002 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4003 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4004 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4005 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:15] PROBLEM - Puppet freshness on cp4006 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:16] PROBLEM - Puppet freshness on cp4007 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:16] PROBLEM - Puppet freshness on cp4008 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:17] PROBLEM - Puppet freshness on cp4009 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:17] PROBLEM - Puppet freshness on cp4010 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:18] PROBLEM - Puppet freshness on cp4011 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:18] PROBLEM - Puppet freshness on cp4012 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:19] PROBLEM - Puppet freshness on cp4013 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:19] PROBLEM - Puppet freshness on cp4014 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:20] PROBLEM - Puppet freshness on cp4015 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:20] PROBLEM - Puppet freshness on cp4016 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:21] PROBLEM - Puppet freshness on cp4017 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:21] PROBLEM - Puppet freshness on cp4018 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:22] PROBLEM - Puppet freshness on cp4019 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:22] PROBLEM - Puppet freshness on cp4020 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:23] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:23] PROBLEM - Puppet freshness on lvs4002 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:24] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:24] PROBLEM - Puppet freshness on lvs4004 is CRITICAL: No successful Puppet run in the last 10 hours [04:26:15] PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 10 hours [04:38:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 04:38:52 UTC 2013 [04:39:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [05:05:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 05:05:51 UTC 2013 [05:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [05:16:15] PROBLEM - Puppet freshness on db1051 is CRITICAL: No successful Puppet run in the last 10 hours [05:16:15] PROBLEM - Puppet freshness on db54 is CRITICAL: No successful Puppet run in the last 10 hours [05:17:15] PROBLEM - Puppet freshness on mw1054 is CRITICAL: No successful Puppet run in the last 10 hours [05:17:15] PROBLEM - Puppet freshness on mw1154 is CRITICAL: No successful Puppet run in the last 10 hours [05:17:15] PROBLEM - Puppet freshness on mw1155 is CRITICAL: No successful Puppet run in the last 10 hours [05:17:15] PROBLEM - Puppet freshness on mw55 is CRITICAL: No successful Puppet run in the last 10 hours [05:17:15] PROBLEM - Puppet freshness on sq52 is CRITICAL: No successful Puppet run in the last 10 hours [05:18:55] RECOVERY - Puppet freshness on osm-cp1001 is OK: puppet ran at Wed Oct 9 05:18:53 UTC 2013 [05:20:15] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 10 hours [05:20:15] PROBLEM - Puppet freshness on db57 is CRITICAL: No successful Puppet run in the last 10 hours [05:20:15] PROBLEM - Puppet freshness on mw1053 is CRITICAL: No successful Puppet run in the last 10 hours [05:20:15] PROBLEM - Puppet freshness on sq50 is CRITICAL: No successful Puppet run in the last 10 hours [05:21:15] PROBLEM - Puppet freshness on db55 is CRITICAL: No successful Puppet run in the last 10 hours [05:21:15] PROBLEM - Puppet freshness on db51 is CRITICAL: No successful Puppet run in the last 10 hours [05:21:15] PROBLEM - Puppet freshness on mw53 is CRITICAL: No successful Puppet run in the last 10 hours [05:24:15] PROBLEM - Puppet freshness on db56 is CRITICAL: No successful Puppet run in the last 10 hours [05:24:15] PROBLEM - Puppet freshness on mw1059 is CRITICAL: No successful Puppet run in the last 10 hours [05:24:16] PROBLEM - Puppet freshness on sq59 is CRITICAL: No successful Puppet run in the last 10 hours [05:26:15] PROBLEM - Puppet freshness on db1056 is CRITICAL: No successful Puppet run in the last 10 hours [05:27:15] PROBLEM - Puppet freshness on sq56 is CRITICAL: No successful Puppet run in the last 10 hours [05:27:15] PROBLEM - Puppet freshness on srv257 is CRITICAL: No successful Puppet run in the last 10 hours [05:28:15] PROBLEM - Puppet freshness on mw56 is CRITICAL: No successful Puppet run in the last 10 hours [05:28:15] PROBLEM - Puppet freshness on srv252 is CRITICAL: No successful Puppet run in the last 10 hours [05:29:15] PROBLEM - Puppet freshness on sq57 is CRITICAL: No successful Puppet run in the last 10 hours [05:30:15] PROBLEM - Puppet freshness on sq53 is CRITICAL: No successful Puppet run in the last 10 hours [05:31:15] PROBLEM - Puppet freshness on db50 is CRITICAL: No successful Puppet run in the last 10 hours [05:31:15] PROBLEM - Puppet freshness on mw1151 is CRITICAL: No successful Puppet run in the last 10 hours [05:32:15] PROBLEM - Puppet freshness on db1059 is CRITICAL: No successful Puppet run in the last 10 hours [05:34:15] PROBLEM - Puppet freshness on mw1152 is CRITICAL: No successful Puppet run in the last 10 hours [05:35:15] PROBLEM - Puppet freshness on srv256 is CRITICAL: No successful Puppet run in the last 10 hours [05:35:35] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 05:35:30 UTC 2013 [05:36:15] PROBLEM - Puppet freshness on mw54 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:15] PROBLEM - Puppet freshness on db52 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [05:38:15] PROBLEM - Puppet freshness on mw1158 is CRITICAL: No successful Puppet run in the last 10 hours [05:39:15] PROBLEM - Puppet freshness on srv258 is CRITICAL: No successful Puppet run in the last 10 hours [05:40:15] PROBLEM - Puppet freshness on db1052 is CRITICAL: No successful Puppet run in the last 10 hours [05:41:15] PROBLEM - Puppet freshness on cp1051 is CRITICAL: No successful Puppet run in the last 10 hours [05:42:15] PROBLEM - Puppet freshness on cp1050 is CRITICAL: No successful Puppet run in the last 10 hours [05:42:15] PROBLEM - Puppet freshness on sq51 is CRITICAL: No successful Puppet run in the last 10 hours [05:43:15] PROBLEM - Puppet freshness on mw1057 is CRITICAL: No successful Puppet run in the last 10 hours [05:43:15] PROBLEM - Puppet freshness on srv253 is CRITICAL: No successful Puppet run in the last 10 hours [05:45:15] PROBLEM - Puppet freshness on sq54 is CRITICAL: No successful Puppet run in the last 10 hours [05:45:15] PROBLEM - Puppet freshness on sq58 is CRITICAL: No successful Puppet run in the last 10 hours [05:48:15] PROBLEM - Puppet freshness on mw51 is CRITICAL: No successful Puppet run in the last 10 hours [05:51:15] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [05:54:15] PROBLEM - Puppet freshness on sq55 is CRITICAL: No successful Puppet run in the last 10 hours [05:55:15] PROBLEM - Puppet freshness on mw52 is CRITICAL: No successful Puppet run in the last 10 hours [05:55:15] PROBLEM - Puppet freshness on srv259 is CRITICAL: No successful Puppet run in the last 10 hours [05:57:15] PROBLEM - Puppet freshness on db59 is CRITICAL: No successful Puppet run in the last 10 hours [05:57:15] PROBLEM - Puppet freshness on mw59 is CRITICAL: No successful Puppet run in the last 10 hours [05:57:15] PROBLEM - Puppet freshness on srv251 is CRITICAL: No successful Puppet run in the last 10 hours [05:58:15] PROBLEM - Puppet freshness on mw1157 is CRITICAL: No successful Puppet run in the last 10 hours [05:59:15] PROBLEM - Puppet freshness on mw1058 is CRITICAL: No successful Puppet run in the last 10 hours [06:01:15] PROBLEM - Puppet freshness on mw1055 is CRITICAL: No successful Puppet run in the last 10 hours [06:01:15] PROBLEM - Puppet freshness on srv254 is CRITICAL: No successful Puppet run in the last 10 hours [06:02:15] PROBLEM - Puppet freshness on mw1051 is CRITICAL: No successful Puppet run in the last 10 hours [06:02:15] PROBLEM - Puppet freshness on mw1156 is CRITICAL: No successful Puppet run in the last 10 hours [06:03:15] PROBLEM - Puppet freshness on mw1050 is CRITICAL: No successful Puppet run in the last 10 hours [06:05:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 06:05:18 UTC 2013 [06:05:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [06:09:15] PROBLEM - Puppet freshness on mw1052 is CRITICAL: No successful Puppet run in the last 10 hours [06:09:15] PROBLEM - Puppet freshness on srv250 is CRITICAL: No successful Puppet run in the last 10 hours [06:11:15] PROBLEM - Puppet freshness on db53 is CRITICAL: No successful Puppet run in the last 10 hours [06:12:15] PROBLEM - Puppet freshness on mw1056 is CRITICAL: No successful Puppet run in the last 10 hours [06:12:15] PROBLEM - Puppet freshness on mw1159 is CRITICAL: No successful Puppet run in the last 10 hours [06:12:43] !log aaron synchronized php-1.22wmf19/includes 'c6d64f5488c124636ab46712e1d104e5c7076325' [06:13:00] Logged the message, Master [06:13:15] PROBLEM - Puppet freshness on db58 is CRITICAL: No successful Puppet run in the last 10 hours [06:13:15] PROBLEM - Puppet freshness on mw57 is CRITICAL: No successful Puppet run in the last 10 hours [06:13:15] PROBLEM - Puppet freshness on mw58 is CRITICAL: No successful Puppet run in the last 10 hours [06:13:15] PROBLEM - Puppet freshness on srv255 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:15] PROBLEM - Puppet freshness on db1050 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:15] PROBLEM - Puppet freshness on mw1150 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:15] PROBLEM - Puppet freshness on mw1153 is CRITICAL: No successful Puppet run in the last 10 hours [06:35:45] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 06:35:44 UTC 2013 [06:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [07:05:25] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 07:05:20 UTC 2013 [07:05:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [07:07:56] (03PS1) 10Springle: repool db1022 in S6. move db1039 to assist with upgrades in S7. depool db1007 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88691 [07:08:46] (03CR) 10Springle: [C: 032] repool db1022 in S6. move db1039 to assist with upgrades in S7. depool db1007 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88691 (owner: 10Springle) [07:18:47] !log springle synchronized wmf-config/db-eqiad.php 'repool db1022 in S6. move db1039 to assist with upgrades in S7. depool db1007 for upgrade' [07:18:58] Logged the message, Master [07:35:55] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 07:35:53 UTC 2013 [07:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [07:51:33] (03PS1) 10Ori.livneh: Navigation Timing: differentiate by auth status rather than wiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/88692 [07:53:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [07:54:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 12.396 second response time [07:57:23] mutante: I am awake :-] [07:57:35] hashar: pong [07:58:11] mutante: rebooting -:] [07:58:18] enjoy your breakfast [07:58:21] I am getting my coffee [07:58:22] connecting to mgmt [07:58:35] ohh [07:58:39] there is another kernel :-) [07:59:00] ah, let's get that too [08:00:18] okk [08:01:06] bah it is merely "Bump ABI" [08:02:16] ah yea, so no new features but can still be fixes [08:02:32] !log rebooting gallium [08:02:43] Logged the message, Master [08:03:15] PROBLEM - Puppet freshness on amssq49 is CRITICAL: No successful Puppet run in the last 10 hours [08:04:29] now we "just" have to wait for fsck :D [08:05:45] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 08:05:42 UTC 2013 [08:06:25] PROBLEM - zuul_service_running on gallium is CRITICAL: Connection refused by host [08:06:25] PROBLEM - SSH on gallium is CRITICAL: Connection refused [08:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [08:06:35] PROBLEM - HTTP on gallium is CRITICAL: Connection refused [08:06:55] PROBLEM - jenkins_service_running on gallium is CRITICAL: Connection refused by host [08:07:19] mutante: ^^^^^ it does :-] [08:07:33] though I have no idea whether that emits pages [08:08:14] the zuul service just notify by irc https://icinga.wikimedia.org/cgi-bin/icinga/notifications.cgi?host=gallium&service=zuul_service_running [08:08:19] hashar: the jekins_service_running does not [08:08:32] monitor_service would have to have critical => true for that [08:08:55] paging is probably unneeded [08:08:59] it is not that critical [08:09:05] nod [08:09:12] at worth devs will complain for a couple hours until some root / eng folk get to restart the services [08:09:23] hmm, i dont really see output on console [08:09:39] maybe reconnect? [08:09:52] apparently NTP is back up [08:10:05] did and gets the garbled output [08:12:40] wth.. mgmt refuses connection after reset [08:12:44] bahh [08:12:48] narf [08:13:05] ah wait, slooow [08:13:43] 583.913308] SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) [08:14:15] PROBLEM - Puppet freshness on amssq48 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:21] it looks stuck,i'd had to show you a screenshot .. [08:14:32] -1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. . . .1;-1fUbuntu 12.041;-1f. [08:15:17] if you stare closely at it, it becomes 3D [08:15:41] hashar: let me powercycle it .. sigh [08:16:02] sorry :( [08:16:06] morning hashar / mutante btw [08:16:14] no reason to be sorry [08:16:47] actually sees BIOS and booting normally now [08:16:54] ori-l: hello :-) [08:16:57] hi ori [08:17:15] PROBLEM - Puppet freshness on amssq51 is CRITICAL: No successful Puppet run in the last 10 hours [08:17:27] /dev/md0 has gone 470 days without being checked, check forced. [08:17:27] Checking disk drives for errors. This may take several minutes. [08:17:30] there we go [08:17:41] yeah several hours would be a more accurate message [08:17:45] danke! [08:18:46] 470 days .. [08:18:51] ori-l: I send a bunch of changes to carbon aggregation and added you as a reviewer [08:18:58] but we rebooted earlier than that [08:19:03] ori-l: not really sure what the impacts are though :( [08:19:16] mutante: yeah a few days ago it was 143~ days uptime [08:19:31] hashar: i only saw the comment change [08:19:42] sounds like we gave up/canceled fsck another time [08:19:50] ori-l: and it is not really a priority :-] [08:20:18] ori-l: also yesterday we had a release/QA weekly checking, we talked a bit about your exception/Fatal to json system. Greg is supposed to reach out with you eventually [08:20:29] ori-l: a good plan would be to enable it on beta cluster for people to play with :] [08:20:45] PROBLEM - DPKG on vanadium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:20:59] mutante: I guess you can context switch to some other small task :] [08:21:12] mutante: I am tailing syslog for gallium ip so will get aware whenever it is back [08:21:15] heh, ok, task, make coffee [08:21:24] cool [08:21:45] RECOVERY - DPKG on vanadium is OK: All packages OK [08:22:09] discovery: the icinga dpkg check fails if it happens to run during apt-get upgrade [08:23:17] ahh good catch [08:24:06] !log Stopping EventLogging, then rebooting vanadium for kernel upgrade [08:24:15] PROBLEM - Puppet freshness on amssq50 is CRITICAL: No successful Puppet run in the last 10 hours [08:24:17] Logged the message, Master [08:24:31] it basically just does dpkg -l | grep -v ^ii [08:25:18] eh, anything that is not ii or rc it doesnt like afair [08:25:46] must be kind of lucky to catch it during an upgrade though [08:27:55] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/vanadium consumer/server-side-events-log consumer/mysql-db1047 consumer/client-side-events-log consumer/all-events-log multiplexer/all-events processor/server-side-events processor/client-side-events forwarder/8422 forwarder/8421 [08:30:18] ori-l: event logging jobs dead ^^^ [08:30:31] hashar: see !log above [08:30:55] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [08:31:12] ori-l: sorry for INT [08:31:26] no problem at all [08:36:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 08:35:56 UTC 2013 [08:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [08:38:35] graphite metric represented as a "stock ticker" : http://square.github.io/cubism/ :D [08:39:02] i actually had that set up on labs at one point for edits [08:39:14] * hashar feels useless [08:39:19] it's not bad but horizon charts a bit hard to read [08:39:41] anyone want to help looking at puppet freshness issue? db1051 (and various other hosts) as of about 19:45 last night, [08:39:43] i find it nice to represent a ton of data at the sametime [08:39:47] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to parse template squid/squid-disk-permissions.erb: Could not find value for 'squid_coss_disks' at 5:/etc/puppet/templates/squid/squid-disk-permissions.erb at /etc/puppet/manifests/squid.pp:67 on node db1051.eqiad.wmnet [08:39:58] apergos: i can look [08:40:05] apergos: why would we have squid on a db server? [08:40:12] saw nothing suspicious in git log or on puppetmaster, another pair of eyes would help [08:40:34] well what's better is that some of the nodes in the same node defn run fine (unless I am reading it wrong) [08:40:51] there's a number of other servers all with the same error, see icinga, some squids and some are dbs [08:41:28] there's no reason on earth we would have hashar and afaict we didn't have (nor do we on the dbs that run successfully) [08:41:56] i.e. I did root@db1017:~# grep squid /var/lib/puppet/state/classes.txt and got nothing back [08:42:05] but now I need to shut up and let someone fresh look at it [08:42:16] PROBLEM - Puppet freshness on amssq52 is CRITICAL: No successful Puppet run in the last 10 hours [08:42:16] PROBLEM - Puppet freshness on amssq53 is CRITICAL: No successful Puppet run in the last 10 hours [08:42:32] apergos: I found that sometime using """ puppetd -tv --evaltrace "" helps [08:42:44] that shows the class being processed by puppet [08:42:52] first thing that stands out (but is probably unrelated) is $squid_coss_disks = split(get_var('squid_coss_disks'), ',') [08:42:57] i was like, 'what the hell is get_var' [08:43:09] so i googled it and what do i find if not a bug report from hashar :P [08:43:11] https://bugzilla.wikimedia.org/show_bug.cgi?id=38524 [08:43:25] hehe [08:43:32] yeah I came across that on beta [08:43:44] but that should be labs related [08:43:46] it appears to be something from a media temple puppet module that was carelessly imported [08:43:58] "Unknown function get_var " [08:44:00] yeah, not causing the issue apergos is seeing, but still worth flagging since it's broken code [08:44:02] so if you look at a squid node, you see how it goes, and (most of ) those erun fine [08:44:45] there is one sure thing which is that db1051 should not have the squid manifest applied [08:44:59] I have no clue how to simulate a run of puppet given a node named 'db1051' [08:46:24] * apergos tries evaltrace [08:46:32] didn't know about that option [08:47:03] nope, no class listed before the error, too bad [08:48:52] apergos: I came across evaltrace a while back and listed it on the [[Puppet]] wikitect article https://wikitech.wikimedia.org/wiki/Puppet#Debugging [08:49:18] well we aren't into that I guess, still waiting for the server to complete parsing [08:49:22] so problematic :-( [08:50:44] apergos: it happens repeatedly and consistently on db1051? [08:50:54] on a pile of hossts [08:51:14] oh yeah, i see the list in the scrollback [08:51:17] https://icinga.wikimedia.org/icinga/ check the passive checks [08:51:31] under critical [08:52:13] notice something interesting? [08:52:38] all the fails are hosts with 5*, 105*, 115* 25* [08:52:41] apergos: icinga.. of all those Puppet freshness checks, many are disabled [08:52:47] and 20 are new ULSFO hosts [08:53:12] talking about mw, srv, db, sq [08:53:24] could be wrong regex in site.pp [08:53:55] (03PS1) 10Addshore: Various user rights config changes on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88694 [08:53:56] node /amssq4[7-9]|5[0-9]|6[0-2]\.esams\.wikimedia\.org$/ { [08:54:26] 724df4d8d55750b6018bf49424762e577efef4ec [08:54:35] ouch [08:54:36] needs a paren [08:55:12] yep [08:55:16] nice catch [08:55:24] no jenkins right now..cough [08:55:28] let me get this in [08:56:29] \O/ [08:56:37] (03PS1) 10ArielGlenn: fix up amssq node expr (caught non squid hosts) [operations/puppet] - 10https://gerrit.wikimedia.org/r/88695 [08:56:44] you will get to force merge it [08:56:47] jenkins busy rebooting [08:56:48] good catch! [08:56:58] somone else wanna double check me please? [08:57:25] I've been staring at this for a long time now so my eyes are tired [08:57:37] but you said the magic words regexp and then it was obvious [08:58:15] PROBLEM - Puppet freshness on amssq58 is CRITICAL: No successful Puppet run in the last 10 hours [08:58:30] I have no clue how to simulate a puppet run for a given node though :( [08:58:35] it looks right. i'd add a ^ at the beginning but it's not critical, mostly a style thing. [08:58:57] looks good, like the example above it [08:59:17] just that it ends in $ and the other doesn't , but same what ori said [09:00:14] (03CR) 10ArielGlenn: [C: 032 V: 032] fix up amssq node expr (caught non squid hosts) [operations/puppet] - 10https://gerrit.wikimedia.org/r/88695 (owner: 10ArielGlenn) [09:00:17] meh, merging anyways, worst that happens we break some more hosts :-P [09:03:15] PROBLEM - Puppet freshness on amssq54 is CRITICAL: No successful Puppet run in the last 10 hours [09:03:26] yeah yeah hush [09:03:35] RECOVERY - Puppet freshness on db1051 is OK: puppet ran at Wed Oct 9 09:03:32 UTC 2013 [09:04:15] PROBLEM - Puppet freshness on amssq61 is CRITICAL: No successful Puppet run in the last 10 hours [09:04:35] RECOVERY - Puppet freshness on db52 is OK: puppet ran at Wed Oct 9 09:04:32 UTC 2013 [09:04:52] guess I'll wait for the dust to settle and see what's left [09:05:05] RECOVERY - Puppet freshness on mw54 is OK: puppet ran at Wed Oct 9 09:04:57 UTC 2013 [09:05:05] RECOVERY - Puppet freshness on mw51 is OK: puppet ran at Wed Oct 9 09:05:02 UTC 2013 [09:05:17] time for breakfast [09:06:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 09:05:57 UTC 2013 [09:06:15] PROBLEM - Puppet freshness on amssq55 is CRITICAL: No successful Puppet run in the last 10 hours [09:06:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [09:06:55] RECOVERY - Puppet freshness on srv258 is OK: puppet ran at Wed Oct 9 09:06:48 UTC 2013 [09:06:55] RECOVERY - Puppet freshness on mw1052 is OK: puppet ran at Wed Oct 9 09:06:53 UTC 2013 [09:07:05] RECOVERY - Puppet freshness on srv250 is OK: puppet ran at Wed Oct 9 09:06:58 UTC 2013 [09:07:35] RECOVERY - Puppet freshness on db53 is OK: puppet ran at Wed Oct 9 09:07:28 UTC 2013 [09:07:43] :) [09:08:16] PROBLEM - Puppet freshness on amssq57 is CRITICAL: No successful Puppet run in the last 10 hours [09:08:16] RECOVERY - Puppet freshness on cp1051 is OK: puppet ran at Wed Oct 9 09:08:14 UTC 2013 [09:08:20] apergos: nice fix, enjoy [09:08:25] hashar: it's coming back:) [09:08:26] RECOVERY - SSH on gallium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [09:08:30] ^ [09:08:34] mutante: hurrah [09:08:36] RECOVERY - Puppet freshness on cp1050 is OK: puppet ran at Wed Oct 9 09:08:34 UTC 2013 [09:08:36] RECOVERY - Puppet freshness on db1052 is OK: puppet ran at Wed Oct 9 09:08:34 UTC 2013 [09:08:47] gallium login: [09:08:54] I am on [09:08:55] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [09:09:25] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/bin/zuul-server [09:09:35] RECOVERY - HTTP on gallium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 563 bytes in 0.001 second response time [09:09:47] HTTPError: HTTP Error 503: Service Temporarily Unavailable [09:09:48] hehe [09:09:52] jenkins busy restarting [09:09:52] ha! [09:10:25] RECOVERY - Puppet freshness on mw1057 is OK: puppet ran at Wed Oct 9 09:10:15 UTC 2013 [09:10:34] !log gallium : zuul backup, waiting for jenkins to complete start up [09:10:45] RECOVERY - Puppet freshness on mw1159 is OK: puppet ran at Wed Oct 9 09:10:35 UTC 2013 [09:10:45] RECOVERY - Puppet freshness on db58 is OK: puppet ran at Wed Oct 9 09:10:35 UTC 2013 [09:10:45] RECOVERY - Puppet freshness on mw1056 is OK: puppet ran at Wed Oct 9 09:10:40 UTC 2013 [09:10:45] Logged the message, Master [09:10:55] RECOVERY - Puppet freshness on srv253 is OK: puppet ran at Wed Oct 9 09:10:50 UTC 2013 [09:12:05] RECOVERY - Puppet freshness on srv255 is OK: puppet ran at Wed Oct 9 09:11:55 UTC 2013 [09:12:15] RECOVERY - Puppet freshness on sq58 is OK: puppet ran at Wed Oct 9 09:12:05 UTC 2013 [09:12:15] RECOVERY - Puppet freshness on mw57 is OK: puppet ran at Wed Oct 9 09:12:05 UTC 2013 [09:12:25] RECOVERY - Puppet freshness on mw58 is OK: puppet ran at Wed Oct 9 09:12:21 UTC 2013 [09:13:05] RECOVERY - Puppet freshness on db1050 is OK: puppet ran at Wed Oct 9 09:13:01 UTC 2013 [09:13:55] RECOVERY - Puppet freshness on sq54 is OK: puppet ran at Wed Oct 9 09:13:46 UTC 2013 [09:13:55] RECOVERY - Puppet freshness on srv256 is OK: puppet ran at Wed Oct 9 09:13:51 UTC 2013 [09:14:55] RECOVERY - Puppet freshness on db54 is OK: puppet ran at Wed Oct 9 09:14:51 UTC 2013 [09:15:05] RECOVERY - Puppet freshness on mw1150 is OK: puppet ran at Wed Oct 9 09:14:56 UTC 2013 [09:15:05] RECOVERY - Puppet freshness on sq52 is OK: puppet ran at Wed Oct 9 09:15:01 UTC 2013 [09:15:15] RECOVERY - Puppet freshness on mw1153 is OK: puppet ran at Wed Oct 9 09:15:11 UTC 2013 [09:15:55] RECOVERY - Puppet freshness on mw1054 is OK: puppet ran at Wed Oct 9 09:15:52 UTC 2013 [09:15:55] RECOVERY - Puppet freshness on mw55 is OK: puppet ran at Wed Oct 9 09:15:52 UTC 2013 [09:16:25] RECOVERY - Puppet freshness on mw1155 is OK: puppet ran at Wed Oct 9 09:16:22 UTC 2013 [09:16:26] RECOVERY - Puppet freshness on mw1154 is OK: puppet ran at Wed Oct 9 09:16:22 UTC 2013 [09:17:55] RECOVERY - Puppet freshness on sq51 is OK: puppet ran at Wed Oct 9 09:17:52 UTC 2013 [09:19:05] RECOVERY - Puppet freshness on sq50 is OK: puppet ran at Wed Oct 9 09:18:58 UTC 2013 [09:19:15] RECOVERY - Puppet freshness on db51 is OK: puppet ran at Wed Oct 9 09:19:08 UTC 2013 [09:19:25] RECOVERY - Puppet freshness on db1058 is OK: puppet ran at Wed Oct 9 09:19:23 UTC 2013 [09:19:55] RECOVERY - Puppet freshness on mw1053 is OK: puppet ran at Wed Oct 9 09:19:48 UTC 2013 [09:20:05] RECOVERY - Puppet freshness on db55 is OK: puppet ran at Wed Oct 9 09:19:59 UTC 2013 [09:20:05] RECOVERY - Puppet freshness on db57 is OK: puppet ran at Wed Oct 9 09:20:04 UTC 2013 [09:20:15] PROBLEM - Puppet freshness on amssq56 is CRITICAL: No successful Puppet run in the last 10 hours [09:21:05] RECOVERY - Puppet freshness on mw53 is OK: puppet ran at Wed Oct 9 09:21:04 UTC 2013 [09:21:05] RECOVERY - Puppet freshness on sq55 is OK: puppet ran at Wed Oct 9 09:21:04 UTC 2013 [09:21:15] PROBLEM - Puppet freshness on amssq59 is CRITICAL: No successful Puppet run in the last 10 hours [09:22:05] RECOVERY - Puppet freshness on db59 is OK: puppet ran at Wed Oct 9 09:21:59 UTC 2013 [09:22:55] RECOVERY - Puppet freshness on srv259 is OK: puppet ran at Wed Oct 9 09:22:45 UTC 2013 [09:22:55] RECOVERY - Puppet freshness on mw52 is OK: puppet ran at Wed Oct 9 09:22:45 UTC 2013 [09:23:15] RECOVERY - Puppet freshness on sq59 is OK: puppet ran at Wed Oct 9 09:23:10 UTC 2013 [09:23:45] RECOVERY - Puppet freshness on mw1059 is OK: puppet ran at Wed Oct 9 09:23:40 UTC 2013 [09:23:45] RECOVERY - Puppet freshness on db56 is OK: puppet ran at Wed Oct 9 09:23:40 UTC 2013 [09:24:05] RECOVERY - Puppet freshness on db1056 is OK: puppet ran at Wed Oct 9 09:23:55 UTC 2013 [09:24:45] RECOVERY - Puppet freshness on mw59 is OK: puppet ran at Wed Oct 9 09:24:40 UTC 2013 [09:25:05] RECOVERY - Puppet freshness on srv251 is OK: puppet ran at Wed Oct 9 09:24:55 UTC 2013 [09:25:45] RECOVERY - Puppet freshness on mw56 is OK: puppet ran at Wed Oct 9 09:25:36 UTC 2013 [09:25:45] RECOVERY - Puppet freshness on srv257 is OK: puppet ran at Wed Oct 9 09:25:41 UTC 2013 [09:25:45] RECOVERY - Puppet freshness on mw1158 is OK: puppet ran at Wed Oct 9 09:25:41 UTC 2013 [09:25:46] hehe [09:25:55] and because apergos now fixed that regex I guess we need to reinstall those servers [09:25:55] RECOVERY - Puppet freshness on mw1058 is OK: puppet ran at Wed Oct 9 09:25:46 UTC 2013 [09:26:02] because they're not supposed to have squid on them [09:26:05] RECOVERY - Puppet freshness on mw1157 is OK: puppet ran at Wed Oct 9 09:25:56 UTC 2013 [09:26:16] PROBLEM - Puppet freshness on amssq60 is CRITICAL: No successful Puppet run in the last 10 hours [09:26:16] RECOVERY - Puppet freshness on sq56 is OK: puppet ran at Wed Oct 9 09:26:06 UTC 2013 [09:26:23] mark: they don't have squid on them because puppet was unable to apply it [09:26:32] but they probably do now [09:26:41] not apergos fault [09:26:45] RECOVERY - Puppet freshness on srv252 is OK: puppet ran at Wed Oct 9 09:26:41 UTC 2013 [09:27:10] oh I mean, the amssq* ones, that the node regex was meant for [09:27:15] RECOVERY - Puppet freshness on sq57 is OK: puppet ran at Wed Oct 9 09:27:06 UTC 2013 [09:27:47] right, not all those other hosts; that would have been awful [09:27:55] RECOVERY - Puppet freshness on sq53 is OK: puppet ran at Wed Oct 9 09:27:51 UTC 2013 [09:27:56] indeed [09:28:05] RECOVERY - Puppet freshness on srv254 is OK: puppet ran at Wed Oct 9 09:28:02 UTC 2013 [09:28:35] RECOVERY - Puppet freshness on mw1055 is OK: puppet ran at Wed Oct 9 09:28:27 UTC 2013 [09:29:05] RECOVERY - Puppet freshness on db50 is OK: puppet ran at Wed Oct 9 09:28:57 UTC 2013 [09:29:05] RECOVERY - Puppet freshness on mw1151 is OK: puppet ran at Wed Oct 9 09:29:02 UTC 2013 [09:29:55] RECOVERY - Puppet freshness on mw1156 is OK: puppet ran at Wed Oct 9 09:29:52 UTC 2013 [09:29:55] RECOVERY - Puppet freshness on mw1050 is OK: puppet ran at Wed Oct 9 09:29:52 UTC 2013 [09:30:05] RECOVERY - Puppet freshness on mw1051 is OK: puppet ran at Wed Oct 9 09:29:57 UTC 2013 [09:30:08] (03PS1) 10Ori.livneh: Remove unused (and broken) labs config from squid manifest [operations/puppet] - 10https://gerrit.wikimedia.org/r/88698 [09:30:15] PROBLEM - Puppet freshness on amssq62 is CRITICAL: No successful Puppet run in the last 10 hours [09:31:05] RECOVERY - Puppet freshness on db1059 is OK: puppet ran at Wed Oct 9 09:30:58 UTC 2013 [09:31:19] (03CR) 10jenkins-bot: [V: 04-1] Remove unused (and broken) labs config from squid manifest [operations/puppet] - 10https://gerrit.wikimedia.org/r/88698 (owner: 10Ori.livneh) [09:31:45] RECOVERY - Puppet freshness on mw1152 is OK: puppet ran at Wed Oct 9 09:31:43 UTC 2013 [09:32:48] (03CR) 10Ori.livneh: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88698 (owner: 10Ori.livneh) [09:33:16] ori-l: jenkins still busy realoading [09:33:37] the '-1' feature appears to have loaded before some of the others :P [09:34:17] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=gallium.wikimedia.org&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Miscellaneous+eqiad [09:34:22] mutante: could you merge https://gerrit.wikimedia.org/r/#/c/88692/ possibly? [09:34:42] !log jenkins back up [09:34:53] Logged the message, Master [09:35:40] ori-l: no. [09:35:46] :> [09:35:50] ori-l: bed time [09:35:51] (03PS2) 10Hashar: Navigation Timing: differentiate by auth status rather than wiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/88692 (owner: 10Ori.livneh) [09:36:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 09:36:00 UTC 2013 [09:36:06] hijacked the change to validate jenkins [09:36:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [09:36:40] you can't make me [09:41:55] RECOVERY - Puppet freshness on mw1126 is OK: puppet ran at Wed Oct 9 09:41:47 UTC 2013 [09:43:25] (03CR) 10Dzahn: [C: 032] Navigation Timing: differentiate by auth status rather than wiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/88692 (owner: 10Ori.livneh) [09:43:46] thanks [09:46:00] mutante: Stop enabling him! [09:46:25] (03CR) 10Dzahn: [C: 032] delete dsh group 'nagios' [operations/puppet] - 10https://gerrit.wikimedia.org/r/88069 (owner: 10Dzahn) [09:47:30] I wonder if it's worth the effort of puppetising the machines I use [09:48:51] Might be useful even if just for an exercise to get more familiar with it [09:48:52] ori-l: np, made sense to me to compare that [09:48:58] Reedy: :P heh [09:49:04] how many nodes would it be? [09:50:07] They're all sort of 1 offs... 6 or 7 [09:50:25] Would enable me to rebuild them quicker if needed [09:50:29] Rather than playing guess the package [09:50:37] yea [09:50:51] my vps setup is mostly puppetized, https://github.com/atdt/2501 [09:51:19] (03PS2) 10Addshore: Various user rights config changes on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88694 [09:51:30] mostly i got tired of reading the znc man page for the 1000th time [09:51:44] Reedy: start with just the package definitions, sounds good [09:51:51] That's what I was thinking [09:52:06] Then there's a few config files, scripts I wrote and other crap that'd be nice to have a copy of [09:52:53] test-box-for-transitional-dummy-font-package-foo :P [09:53:25] RECOVERY - Puppet freshness on terbium is OK: puppet ran at Wed Oct 9 09:53:23 UTC 2013 [09:57:58] ACKNOWLEDGEMENT - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours daniel_zahn shut it. removed from puppet.RT #5908 [10:01:30] mutante: Jenkins is fine now. Thank you very much :-] [10:02:04] ori-l: if you are still around, bottom of https://integration.wikimedia.org/zuul/ has a surprise for you :-] [10:03:26] hashar: great:) yw [10:03:38] ah, neat [10:05:01] whenever I get zuul upgraded we will get an idea of how long changes have been waiting [10:05:10] and even provide an estimation of the completion [10:06:32] (03PS2) 10Dzahn: redirect pk.wikimedia.org to meta community page [operations/apache-config] - 10https://gerrit.wikimedia.org/r/86652 [10:08:05] RECOVERY - Puppet freshness on williams is OK: puppet ran at Wed Oct 9 10:07:56 UTC 2013 [10:08:35] PROBLEM - Puppet freshness on williams is CRITICAL: No successful Puppet run in the last 10 hours [10:12:01] (03CR) 10Ladsgroup: [C: 031] "as another native speaker I confirm the translation, thank you Ebrahim for doing this :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88108 (owner: 10Ebrahim) [10:17:40] (03CR) 10Dzahn: [C: 04-2] "oh, wait, all of a sudden we own this. Record last updated on..: 2013-10-08. just need to make MarkMonitor switch NS over to us and we'll " [operations/dns] - 10https://gerrit.wikimedia.org/r/86658 (owner: 10Dzahn) [10:18:18] (03Abandoned) 10Dzahn: remove wikipaedia.net , not in list per RT #5681 [operations/dns] - 10https://gerrit.wikimedia.org/r/86658 (owner: 10Dzahn) [10:34:15] (03PS1) 10Dzahn: redirect vikipedi[a].com.tr to tr.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/88705 [10:36:05] (03CR) 10Mark Bergsma: [C: 032] Move the remaining non-wikipedia projects to text-varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/88509 (owner: 10Mark Bergsma) [10:38:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [10:40:23] (03PS1) 10Dzahn: add vikipedia.com.tr and vikipedia.com.tr RT #5928 [operations/dns] - 10https://gerrit.wikimedia.org/r/88706 [10:44:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.181 second response time [10:54:59] (03CR) 10Dzahn: [C: 031] "using this for some maintenance and didn't want to keep unpuppetized files. makes more sense though if we'd keep it in sync on changes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88126 (owner: 10Dzahn) [10:57:37] !log Non-Wikipedia projects are now on Varnish for eqiad non-ipv6, non-https traffic [10:57:52] Logged the message, Master [11:02:23] (03CR) 10Dzahn: "this is _really_ not used for OTRS mail anymore, right?" [operations/dns] - 10https://gerrit.wikimedia.org/r/88147 (owner: 10Dzahn) [11:03:38] (03CR) 10Dzahn: "Jeff, fwiw, there's also jgreens_otrs_cgi_experiments on there (before it gets wiped)" [operations/dns] - 10https://gerrit.wikimedia.org/r/88147 (owner: 10Dzahn) [11:06:52] ERROR: tab character found on line 2 [11:06:56] wtf puppet-lint [11:07:41] Reedy: welcome to retabbing ,hehe [11:08:19] wait, here's the .vimrc [11:08:28] I don't use vim :p [11:08:32] Don't we prefer tabs in the puppet repo? [11:09:02] it's 4-space now [11:09:03] not anymore [11:09:08] yes, that [11:09:15] for puppet manifests [11:09:20] we're not gonna retab all configs etc [11:10:29] yeah, wasn't suggesting we did [11:10:41] Slightly scary that tabs == Error [11:10:51] Reedy: https://wikitech.wikimedia.org/wiki/Puppet_coding#common_errors [11:10:57] yes [11:11:12] replaces the 2-space comment there with 4-space [11:11:43] ah, but the .vimrc is already good [11:11:47] what's that 80 space nonsense on that page [11:11:50] 80 char [11:12:15] well, I guess puppet lint would complain about it ;) [11:12:17] they are all just errors from puppet-lint [11:12:47] yea, just a list of the most common ones you get [11:13:55] $ puppet-lint --with-filename manifests/|wc -l [11:13:56] 41808 [11:13:57] :( [11:15:10] hashar: but i'm pretty sure that's well below what it was , i once had that number too quite a while ago [11:15:59] eh, and you'd have to compare ratio, errors/line [11:17:46] hashar: on March 22, "we get 35344 warnings or errors on 27982 lines of code. that's 1.26 per line" [11:18:28] 20362 of them were "tab character" [11:19:48] yeah slowly improving :-] [11:21:55] hashar: pasting your better example to apply 4-space to puppet files only [11:22:40] (03CR) 10Akosiaris: [C: 04-2] "It was right before." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88507 (owner: 10Chad) [11:24:02] templatedir=$confdir/templates [11:24:08] $confdir isn't defined in the config :/ [11:27:49] mutante: https://gerrit.wikimedia.org/r/#/c/88706 this looks ok to me. Want to to submit and update DNS ? [11:30:22] akosiaris: sure, thanks, i think in this case the exact order doesnt matter (NS change, add to us, Apache redir) [11:31:05] mutante: yes I think the same. Doing it now [11:31:14] (03CR) 10Akosiaris: [C: 032] add vikipedia.com.tr and vikipedia.com.tr RT #5928 [operations/dns] - 10https://gerrit.wikimedia.org/r/88706 (owner: 10Dzahn) [11:33:10] i just had to mention the address they used;), San Francisco, Out of Turkey, Aruba [11:34:09] looooooooool [11:34:29] Aruba Jamaica.. oooh i wanna take ya [11:34:34] hehee [11:34:50] btw... the guys mounting that attack on spamhaus last spring ? [11:35:08] RIPE had an entry for them being based in Antartica [11:35:25] :o heh [11:35:37] and a phone in netherlads :P [11:36:13] once read about passport stamp collectors who want to "collect them all" and they talked about that one station in Antarctica that stamps passports of visitors [11:40:10] andre__: you must have loved this bug report message , so specific :) Bug 55498 - Webserver is down [11:40:30] (and then it's a tool labs project) [11:47:46] do need components for each project ?:P /me hides [11:58:15] RECOVERY - Puppet freshness on cp4001 is OK: manual debugging test [11:58:21] apergos: ^ [11:58:45] uh huh [11:58:49] so it arrives at the host and the snmptt command triggers icinga, must be in between [11:59:10] yep [11:59:24] so does the file get written? can't tell, if not why not? [11:59:35] if so, why doesn't it get processed right? [11:59:50] where "the file" is the file in /var/spool/snmptt or whatever [12:02:17] if it wasn't just ulsfo i would think snmptt needs killing and if you would have already checked tcpdump i would have thought iptables or networking [12:02:38] but you saw the packet..so why just ulsfo [12:03:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [12:04:02] don't know [12:07:06] snmptt doesn't get it, snmptt.log has nothing from ulsfo [12:07:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 26.462 second response time [12:10:45] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [12:11:28] wha [12:11:38] zillion pages [12:11:40] damn [12:11:43] ? [12:11:56] esams lbs down [12:12:14] arg, wait sorry [12:12:16] probably just monitoring [12:12:17] i thin it was me [12:12:42] i wanted to log dropped iptables rules [12:12:44] on neon [12:12:49] maybe the other way around... i can not connect icinga [12:12:57] mutante: ok ok... makes sense then [12:13:18] sorry, i flushed the run and reapplying it [12:13:43] the iptables rules, i just intended to log if anything gets drop from ulsfo [12:14:03] 70% packet loss now [12:14:07] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:07] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:07] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:07] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:07] PROBLEM - Host foundation-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [12:15:10] can't help with that pkt loss, sorry [12:15:30] paravoid: there is no problem, don't worry [12:15:45] monitoring issues [12:15:47] i just made a mistake on neon itself [12:15:55] sorry for pages [12:16:39] we were debugging why puppet freshness doesnt work for ulsfo hosts [12:16:57] do we expect recovery messages? [12:17:24] apergos: yes [12:17:40] RECOVERY - Puppet freshness on cp4014 is OK: puppet ran at Wed Oct 9 12:17:38 UTC 2013 [12:17:49] there they are [12:17:50] RECOVERY - Host upload.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 90.27 ms [12:18:00] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 90.50 ms [12:18:03] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 91.85 ms [12:18:04] apergos: it is iptables then [12:18:10] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 93.81 ms [12:18:50] RECOVERY - Puppet freshness on cp4019 is OK: puppet ran at Wed Oct 9 12:18:44 UTC 2013 [12:19:40] RECOVERY - Puppet freshness on cp4005 is OK: puppet ran at Wed Oct 9 12:19:30 UTC 2013 [12:20:00] RECOVERY - Puppet freshness on cp4015 is OK: puppet ran at Wed Oct 9 12:19:55 UTC 2013 [12:21:20] RECOVERY - Puppet freshness on cp4017 is OK: puppet ran at Wed Oct 9 12:21:15 UTC 2013 [12:21:20] RECOVERY - Puppet freshness on lvs4001 is OK: puppet ran at Wed Oct 9 12:21:15 UTC 2013 [12:21:26] yes, that at least is cleared up [12:25:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [12:26:31] rules restored [12:27:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 29.059 second response time [12:30:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [12:40:34] (03PS1) 10Springle: db1039 to s7 [operations/puppet] - 10https://gerrit.wikimedia.org/r/88723 [12:43:21] (03CR) 10Springle: [C: 032] db1039 to s7 [operations/puppet] - 10https://gerrit.wikimedia.org/r/88723 (owner: 10Springle) [12:43:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 24.456 second response time [12:46:20] (03PS1) 10Dzahn: add iptables accept from ulsfo for monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/88724 [12:46:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection timed out [12:49:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 28.604 second response time [12:55:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [12:55:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [12:56:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [12:57:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [12:58:13] RECOVERY - Puppet freshness on db1049 is OK: puppet ran at Wed Oct 9 12:58:07 UTC 2013 [12:58:31] time to replace that with Ferm [12:58:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [12:59:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:00:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:01:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:01:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 23.700 second response time [13:02:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:03:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:04:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:04:56] apergos: ^ now we know why it had notifications disabled ?:P [13:05:21] well next is fixing that instead of disabling it [13:05:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:06:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:07:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:08:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [13:08:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 16.632 second response time [13:08:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:09:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:10:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:11:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:12:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:13:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:14:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:15:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [13:15:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 11.920 second response time [13:15:53] PROBLEM - Puppet freshness on db1049 is CRITICAL: No successful Puppet run in the last 10 hours [13:19:25] !log start dump db1007 to db1039 for S7 file per table [13:19:37] Logged the message, Master [13:42:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [13:42:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 19.811 second response time [13:45:25] (03CR) 10Dereckson: [C: 031] "Groups exist. Rights has been checked. Ok." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88694 (owner: 10Addshore) [13:46:27] (03PS1) 10Cmjohnson: adding mw1125 back to dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/88731 [13:47:00] (03PS3) 10Dereckson: Various user rights config changes on Wikidata. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88694 (owner: 10Addshore) [13:47:11] (03CR) 10Dereckson: [C: 031] User group rights configuration on Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/88694 (owner: 10Addshore) [13:48:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [13:48:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 13.036 second response time [13:48:57] we don't use flap_detection in icinga it seems. we set it in generic-service and generic-host "flap_detection_enabled 1" but in global icinga.cfg, we enable_flap_detection=0. wondered if flap_detect going wrong could have anything to do with it [13:49:01] apergos: ^ [13:50:54] (03CR) 10Cmjohnson: [C: 032] adding mw1125 back to dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/88731 (owner: 10Cmjohnson) [13:51:33] hmmm http://docs.icinga.org/latest/en/flapping.html [13:52:11] interesting [14:06:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [14:06:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 28.568 second response time [14:08:23] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [14:10:23] PROBLEM - Puppet freshness on bast4001 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:23] PROBLEM - Puppet freshness on cp4002 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:23] PROBLEM - Puppet freshness on cp4003 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:23] PROBLEM - Puppet freshness on cp4004 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:23] PROBLEM - Puppet freshness on cp4006 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:24] PROBLEM - Puppet freshness on cp4008 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:24] PROBLEM - Puppet freshness on cp4007 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:25] PROBLEM - Puppet freshness on cp4009 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:25] PROBLEM - Puppet freshness on cp4010 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:26] PROBLEM - Puppet freshness on cp4011 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:26] PROBLEM - Puppet freshness on cp4012 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:27] PROBLEM - Puppet freshness on cp4013 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:27] PROBLEM - Puppet freshness on cp4016 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:28] PROBLEM - Puppet freshness on cp4018 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:28] PROBLEM - Puppet freshness on cp4020 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:29] PROBLEM - Puppet freshness on lvs4002 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:29] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:30] PROBLEM - Puppet freshness on lvs4004 is CRITICAL: No successful Puppet run in the last 10 hours [14:12:07] mutante: would you like to have a go at replacing those iptables rules with ferm? [14:12:24] because this just fixes it until the next dc/ip change ;) [14:13:17] (03PS1) 10Cmjohnson: isolating amssq47, changing role for amssq47-62 [operations/puppet] - 10https://gerrit.wikimedia.org/r/88742 [14:13:47] (03CR) 10jenkins-bot: [V: 04-1] isolating amssq47, changing role for amssq47-62 [operations/puppet] - 10https://gerrit.wikimedia.org/r/88742 (owner: 10Cmjohnson) [14:14:06] that... not what I mean [14:14:24] cmjohnson1: I'd like the amssq47 node definition to be the same as how it was before you changed it yesterday [14:14:29] separate completely, not in the regex [14:14:33] and with the ssl role in it [14:15:02] ok [14:15:25] (03Abandoned) 10Cmjohnson: isolating amssq47, changing role for amssq47-62 [operations/puppet] - 10https://gerrit.wikimedia.org/r/88742 (owner: 10Cmjohnson) [14:16:17] mark: yea, actually i have already made a note [14:17:02] saw example used in gitblit role [14:17:39] yes [14:17:56] we shouldn't have these ip address ranges again and again everywhere [14:18:06] ferm will help with that [14:18:51] nod, need to check out what base::firewall does as a default [14:26:30] (03PS2) 10Dereckson: Logo configuration for *.wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86660 [14:27:03] PROBLEM - MySQL Processlist on db1020 is CRITICAL: CRIT 119 unauthenticated, 0 locked, 0 copy to table, 0 statistics [14:27:03] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:03] PROBLEM - Apache HTTP on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:03] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - 3495 bytes in 0.184 second response time [14:27:04] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:04] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:04] PROBLEM - Apache HTTP on mw1212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:05] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:05] PROBLEM - Apache HTTP on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:11] (03CR) 10Dereckson: "PS2: Rebased" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86660 (owner: 10Dereckson) [14:27:13] PROBLEM - Apache HTTP on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:13] PROBLEM - Apache HTTP on mw1166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:13] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:13] PROBLEM - Apache HTTP on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:13] PROBLEM - Apache HTTP on mw1217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:21] Request: POST http://www.wikidata.org/w/index.php?title=Wikidata_talk:Bots&action=submit, from 10.64.0.137 via cp1006.eqiad.wmnet (squid/2.7.STABLE9) to () [14:27:21] Error: ERR_CANNOT_FORWARD :< [14:27:39] (03CR) 10Dereckson: "(logos)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86660 (owner: 10Dereckson) [14:28:07] uh oh [14:28:21] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:21] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:21] RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.628 second response time [14:28:21] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.376 second response time [14:28:23] RECOVERY - LVS HTTP IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 78098 bytes in 2.083 second response time [14:28:23] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.958 second response time [14:28:23] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.205 second response time [14:28:23] RECOVERY - Apache HTTP on mw1175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.849 second response time [14:28:23] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.020 second response time [14:28:24] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.497 second response time [14:28:24] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.768 second response time [14:28:25] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.793 second response time [14:28:25] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.202 second response time [14:28:26] RECOVERY - Apache HTTP on mw1215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.257 second response time [14:28:26] RECOVERY - Apache HTTP on mw1186 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.519 second response time [14:28:27] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.622 second response time [14:28:27] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.296 second response time [14:28:28] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.424 second response time [14:28:28] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.459 second response time [14:28:29] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.755 second response time [14:28:29] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [14:28:30] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.044 second response time [14:28:30] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.053 second response time [14:28:31] RECOVERY - Apache HTTP on mw1161 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [14:28:31] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [14:28:33] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.530 second response time [14:28:33] RECOVERY - LVS HTTP IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 593 bytes in 3.476 second response time [14:28:35] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.045 second response time [14:28:36] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [14:28:36] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [14:28:36] RECOVERY - Backend Squid HTTP on amssq31 is OK: HTTP OK: HTTP/1.0 200 OK - 1423 bytes in 0.187 second response time [14:28:36] RECOVERY - Frontend Squid HTTP on amssq45 is OK: HTTP OK: HTTP/1.0 200 OK - 1419 bytes in 0.464 second response time [14:28:36] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.969 second response time [14:28:36] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.111 second response time [14:28:37] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.595 second response time [14:28:37] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.814 second response time [14:28:38] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.014 second response time [14:28:38] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.749 second response time [14:28:39] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.863 second response time [14:28:39] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.563 second response time [14:28:40] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.374 second response time [14:28:40] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.686 second response time [14:28:41] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.394 second response time [14:28:43] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.189 second response time [14:28:43] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.479 second response time [14:28:43] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.657 second response time [14:28:43] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.501 second response time [14:28:43] PROBLEM - Frontend Squid HTTP on amssq33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:49] :P [14:28:53] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [14:28:53] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.092 second response time [14:28:53] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.047 second response time [14:28:53] RECOVERY - Apache HTTP on mw1212 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.053 second response time [14:28:53] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [14:28:54] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.062 second response time [14:28:54] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.073 second response time [14:29:03] RECOVERY - LVS HTTPS IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 594 bytes in 0.130 second response time [14:29:06] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.046 second response time [14:29:06] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [14:29:06] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [14:29:06] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [14:29:06] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [14:29:25] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.204 second response time [14:29:25] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.500 second response time [14:29:25] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.580 second response time [14:29:25] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.349 second response time [14:29:25] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.400 second response time [14:29:26] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.711 second response time [14:29:33] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.445 second response time [14:29:33] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.073 second response time [14:29:33] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.072 second response time [14:29:33] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.079 second response time [14:29:33] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.082 second response time [14:29:33] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.079 second response time [14:29:34] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.089 second response time [14:29:34] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.105 second response time [14:29:35] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.765 second response time [14:29:35] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.403 second response time [14:29:43] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.008 second response time [14:29:43] RECOVERY - Frontend Squid HTTP on amssq33 is OK: HTTP OK: HTTP/1.0 200 OK - 1417 bytes in 7.920 second response time [14:29:43] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.087 second response time [14:29:53] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [14:29:54] RECOVERY - Backend Squid HTTP on amssq33 is OK: HTTP OK: HTTP/1.0 200 OK - 1423 bytes in 0.180 second response time [14:30:03] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [14:30:03] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.106 second response time [14:30:14] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.090 second response time [14:30:14] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.089 second response time [14:30:15] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.096 second response time [14:30:18] (03PS1) 10Cmjohnson: Fixing role for amssq48-62, reverting amssq47 change [operations/puppet] - 10https://gerrit.wikimedia.org/r/88744 [14:30:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.602 second response time [14:31:40] (03CR) 10Mark Bergsma: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88744 (owner: 10Cmjohnson) [14:34:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [14:34:52] mark: wondering where $INTERNAL used in the ferm rule becomes 10.0.0.0/8 when it's applied (as on antimony) [14:34:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 16.956 second response time [14:35:26] mutante: some config file, perhaps in the module? [14:35:28] iirc [14:35:50] (03PS2) 10Cmjohnson: Fixing role for amssq48-62, reverting amssq47 change [operations/puppet] - 10https://gerrit.wikimedia.org/r/88744 [14:36:45] (03CR) 10Mark Bergsma: [C: 031] Fixing role for amssq48-62, reverting amssq47 change [operations/puppet] - 10https://gerrit.wikimedia.org/r/88744 (owner: 10Cmjohnson) [14:38:05] (03CR) 10Cmjohnson: [C: 032] Fixing role for amssq48-62, reverting amssq47 change [operations/puppet] - 10https://gerrit.wikimedia.org/r/88744 (owner: 10Cmjohnson) [14:38:33] hmmm seems like some pages from 10 mins ago just arrived... [14:38:48] weird [14:38:56] :) [14:38:57] mark: ah, modules/base/files/firewall as opposed to modules/ferm/ [14:41:28] (03PS3) 10Akosiaris: elasticsearch plugins in git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/88455 [14:42:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [14:42:18] (03CR) 10Akosiaris: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88455 (owner: 10Akosiaris) [14:42:53] akosiaris: ah git-deploy !!! [14:43:10] hasharCall: yeeees ? [14:43:14] akosiaris: do we have any basic doc to add a project to git-deploy ? I am too lazy to reverse engineer the puppet manifests [14:43:28] akosiaris: andI hate how you have to change several hash all over the class :( [14:43:44] that makes two of us [14:43:54] ok so that is bug 1 [14:44:10] (our doc sucks) [14:44:16] https://wikitech.wikimedia.org/wiki/Git-deploy [14:44:36] this needs some updates... i am more of less testing it right now [14:44:50] yeah I have seen that one, not that helpful :( [14:45:07] really ? it is not that bad [14:45:21] I should bribe Ryan Lane to write us a noob step-by-step tutorial [14:45:45] he is going to just send you the same link [14:46:11] please .. send me an email with the obscure parts so i can fix them [14:46:15] (03PS1) 10Mark Bergsma: Remove upstream definition as we only connect to localhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/88747 [14:46:21] mutante: mw1125...before I enable pybal plz review make sure I didn't overlook something bringing it back to life with the new disk [14:46:43] (03CR) 10Akosiaris: [C: 032] elasticsearch plugins in git-deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/88455 (owner: 10Akosiaris) [14:47:28] akosiaris: greg-g is taking of it already :-] [14:47:42] akosiaris: we evoked the issue yesterday during a release/QA weekly checkin meeting. [14:48:03] cool. looking forward to it then [14:48:14] I might eventually have to dig in that myself [14:48:21] gotta need it to deploy some shell scripts to all jenkins slaves [14:48:35] I can't find a way to do it in Jenkins itself (aka run the same job on all slaves) [14:51:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 16.040 second response time [14:52:53] RECOVERY - HTTPS on amssq47 is OK: OK - Certificate will expire on 01/20/2016 12:00. [14:55:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [14:59:25] akosiaris: mailed you / ryan / greg regarding salt & git-deploy. Thanks for the suggestion! [14:59:34] :-) [15:01:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 27.977 second response time [15:02:30] akosiaris: also while you are context switched there: I got a bunch of debian backports pending. Want me to send you a summary ? [15:02:41] apparently faidon piped all requests to you :-] [15:02:55] i am gonna get him for that... [15:03:00] but yes ... please do [15:06:44] btw cmjohnson1 or mark, not sure who: [15:06:48] Oct 9 15:04:44 neon puppet-agent[19054]: Could not retrieve catalog from remote server: Error 400 on SERVER: Must pass ip_address to Monitor_service_lvs_http[wiktionary-lb.eqiad.wikimedia.org] at /etc/puppet/manifests/lvs.pp:984 on node neon.wikimedia.org [15:07:11] ah right [15:07:13] that would be me [15:10:22] (03PS1) 10Mark Bergsma: Correct text/text-varnish IP hash references [operations/puppet] - 10https://gerrit.wikimedia.org/r/88751 [15:10:35] akosiaris: I've marked you down as working on monitoring refactors here: https://wikitech.wikimedia.org/wiki/Puppet_Todo#Manifests_status [15:10:51] andrewbogott: :-) [15:11:00] akosiaris: But, regarding https://gerrit.wikimedia.org/r/#/c/88507/, there is an actual bug there, which is that it depends on an undefined package. [15:11:09] We can just remove that line, but things may happen out of order in that case. [15:11:32] well, that line may be intentionally there [15:12:15] because this class is supposed to be included only on hosts that also include a class that provide Package['icinga'] [15:12:18] namely icinga hosts... [15:12:34] that being said... I don't like it. [15:12:46] Um… well, a) depending on a sometimes-undefined-package as flow control is bad, and b) I'm pretty sure that package is /never/ defined. [15:12:46] Maybe b) is wrong... [15:13:20] (03CR) 10Mark Bergsma: [C: 032] Remove upstream definition as we only connect to localhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/88747 (owner: 10Mark Bergsma) [15:13:23] b) must be wrong because it works on the icinga host... [15:13:32] for a) i am with you [15:13:33] Yeah, you're right. [15:13:35] So it is flow control. [15:13:39] (03CR) 10Mark Bergsma: [C: 032] Correct text/text-varnish IP hash references [operations/puppet] - 10https://gerrit.wikimedia.org/r/88751 (owner: 10Mark Bergsma) [15:13:47] I'll see if I can fix that. [15:14:26] i have the feeling it won't be easy to fix in-module [15:14:43] akosiaris: regarding monitoring refactors… do you think that e.g. icinga config files should live in the icinga module, or should they live in the respective modules of the components that are getting monitored? [15:15:00] In this patch I move a bunch of mysql monitoring stuff into the mysql module, but I'm not 100% certain that's the right approach. https://gerrit.wikimedia.org/r/#/c/88666/ [15:15:07] ahhhh monitoring [15:15:22] b) as virtual resources realized by the icinga module [15:15:26] some folks gave me a crazy tip: have icinga to send plugins performances data to statsd / graphite. So we can graph them :-] [15:16:19] akosiaris: sorry, I don't quite understand what you mean by virtual-resources-realized. [15:16:39] hashar, that seems reasonable :) [15:17:14] andrewbogott: I think it is the right approach but the manifests should define the file resources populating the plugins virtual (with a single @ before them) [15:17:26] Jeff_Green: the ulsfo fundraising backup server will arrive tomorrow. I'll rack it up and get the mgmt accessible. Did you need anything special for setup or just typical OS? [15:17:38] and tagging them so the icinga module can realize them with a collection [15:18:01] like <| File tag == 'icingaplugin |> [15:18:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:18:16] RobH: thinking... [15:18:24] Oh! OK, I've never done something like that, let me read some docs. [15:18:55] so ... in case the icinga classes are not included in a host, the plugins never get realized, stuff never gets populated... otherwise... we get what we want :-) [15:19:25] RobH: let's do a hardware RAID1 with /boot, swap, and the rest in / [15:19:40] robh and just the usual minimal precise install [15:20:02] I'll hook it up to frack puppet, and set up the add-on drives as software RAID1 later [15:20:03] cool, i tend to put the / into an lvm [15:20:05] that cool? [15:20:16] eh [15:20:20] no lvm? [15:20:25] i don't bother personally [15:20:28] thats fine too, just default is now LVM [15:20:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 23.123 second response time [15:20:55] i got my car stuff all fixed yesterday, so now i can drive to ulsfo, no more taxis! [15:20:59] woooo [15:21:01] nice [15:21:07] car insurance in CA is insane expensive. [15:21:13] oh yes [15:21:17] I spend more per month now than I did per 3 months in DC [15:21:28] you know to shop around right? it's really varied [15:21:38] akosiaris: OK, reading a bit (but not enough yet, probably…) defining these files as virtual in the icinga code is basically just documentation, right? [15:21:42] in MA it's all regulated [15:21:50] I've been with progressive now for a decade, my discounts are good [15:21:57] but i tried state farm and geico as well [15:22:02] Given that the icinga manifests don't /need/ to know about the files... [15:22:13] RobH: I'll send you a recommendation by pm [15:22:19] cool [15:22:26] cuz can always swap insurance, get refund =] [15:23:17] andrewbogott: no.... if you don't declare in the icinga module (or wherever for that matter) something realizing them there will never be any files in the systems in question [15:24:07] the idea is that you transfer the decision whether a resource becomes real [15:24:13] from the actual class to another class [15:24:15] Oh, maybe I had this backwards… you think they should be declared (virtually) by mysql, and realized by the icinga module? [15:24:21] exactly [15:25:08] Ok, clearly I need to read more... [15:25:12] Hey guys. I'm having trouble connecting to db1047 from stat1.wikimedia.org. [15:25:18] "Can't connect to MySQL server on 'db1047.eqiad.wmnet'" [15:25:28] Was working 10 minutes ago. [15:26:04] hashar: akosiaris I was pinged because of git-deploy? Do I need to read scrollback more closely? :) [15:26:23] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-db1047 [15:26:29] greg-g: i can summarize [15:27:48] I have started deploying elastic search plugins with git-deploy. Hashar noticed and asked for some more clarification on how/what git-deploy should be used for and more documentation [15:28:31] plus some possible changes... mostly DRY stuff [15:29:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:29:17] greg-g: na it is your email [15:29:34] greg-g: basically was ranting at git-deploy doc, alexanders hinted to send an email to ryan , him and you ) [15:29:41] cool, perfect :) [15:35:25] greg-g: just so you're aware more projects are on varnish in eqiad as of today [15:35:36] everything non wikipedia that is not accessed over ipv6 or https [15:36:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 28.077 second response time [15:36:07] Oooh, more text on Varnish? Nice [15:36:12] next steps are doing the same in esams [15:36:14] mark: cool! [15:36:20] and then we're gonna look at moving the big one ;) [15:36:52] mark again: neon, Oct 9 15:34:43 neon puppet-agent[15026]: Could not retrieve catalog from remote server: Error 400 on SERVER: left operand of - is not a number at /etc/puppet/manifests/lvs.pp:1164 on node neon.wikimedia.org [15:36:56] i'll plan that for a few days before my 3 week vacation to thailand I think [15:37:35] apergos: fixing [15:37:37] mark: wow, 3 weeks? niiice [15:37:46] thanks [15:38:03] (03PS1) 10Mark Bergsma: Correct typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/88754 [15:39:27] honestly, text varnish traffic jumped by just 10% or so today [15:39:34] with all the remaining non-wikipedia projects [15:39:45] doesn't stack up to commons/meta ;) [15:39:48] (03PS2) 10Hashar: Remove unused (and broken) labs config from squid manifest [operations/puppet] - 10https://gerrit.wikimedia.org/r/88698 (owner: 10Ori.livneh) [15:39:57] (03CR) 10Hashar: [C: 031] Remove unused (and broken) labs config from squid manifest [operations/puppet] - 10https://gerrit.wikimedia.org/r/88698 (owner: 10Ori.livneh) [15:40:20] (03CR) 10Mark Bergsma: [C: 032] Correct typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/88754 (owner: 10Mark Bergsma) [15:42:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [15:43:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 26.151 second response time [15:46:12] (03PS1) 10Dzahn: add networks to ferm,convert icinga iptables to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/88755 [15:46:30] (03CR) 10jenkins-bot: [V: 04-1] add networks to ferm,convert icinga iptables to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/88755 (owner: 10Dzahn) [15:47:06] mutante: you should split that in different commits [15:48:00] and translating the existing iptables stuff 1:1 to ferm is probably not the right way to go at it... [15:48:41] mark: ok, splitting it [15:49:51] mutante: coool! [15:49:57] I'd be happy to help with that [15:50:06] I'm currently on a trip with very crappy conference wifi [15:50:22] but I can provide asynchronous help for now [15:50:35] :) [15:51:39] (03PS2) 10Dzahn: add our networks as variables to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/88755 [15:52:31] actually, I have another take on that [15:52:48] we should add bastions (and whatever else is missing) to network.pp [15:52:58] and then make this a template generated from network.pp [15:53:05] yes [15:53:55] so that we have one single source for network info [15:54:08] mutante: ^ [15:54:32] also, there's no reason to put single IPs between parentheses [15:54:56] single ips/networks that is [15:55:10] but it doesn't hurt either [15:58:57] ok, I've got to go [15:59:15] everyone went off for socializing/drinks [15:59:40] paravoid: thanks, enjoy the drinks [16:04:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:05:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.045 second response time [16:06:38] * apergos wonders why mw1017 has packages 5.3.10-1ubuntu3.8+wmf1 for many php packages and now wants to downgrade to 5.3.10-1ubuntu3.6+wmf1 (and fails)  [16:08:03] for comparison, mw1020 is happily on 5.3.10-1ubuntu3.6+wmf1 in the first place [16:09:27] apergos: that would be me [16:10:01] see https://rt.wikimedia.org/Ticket/Display.html?id=5912 [16:14:15] ah all to be completed... tomorrow? or 10-08? anyways, shortly [16:15:28] ok, puppet is unhappy over there but that wil go away at the end of this, thanks for the info [16:20:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:21:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 22.258 second response time [16:23:56] (03PS2) 10Andrew Bogott: Tune up installation behavior for /usr/lib/nagios/plugins/check_elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/88507 (owner: 10Chad) [16:25:38] akosiaris: First try, does this look right? [16:25:49] looking at it now [16:26:18] (03CR) 10Akosiaris: [C: 032] Tune up installation behavior for /usr/lib/nagios/plugins/check_elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/88507 (owner: 10Chad) [16:26:22] :-) [16:28:18] cool, thanks. [16:31:17] anybody here know much about git deploy? [16:33:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [16:34:23] ottomata: that would be me [16:34:36] awesome! [16:34:43] i am at least trying to perform my first deploy today [16:34:46] nice! [16:34:49] what are you deploying? [16:34:56] elasticsearch plugins [16:34:59] aye cool [16:35:06] great, then you are def the man I want to talk to [16:35:18] i'm looking into using it for kraken [16:35:22] which is also mostly java stuff [16:36:31] right now, all I really want to do is deploy versions of the kraken repository to a directory on all of the analytics (hadoop) nodes [16:36:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 15.574 second response time [16:37:07] (03CR) 10Faidon Liambotis: [C: 04-1] "18:52 < paravoid> actually, I have another take on that" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88755 (owner: 10Dzahn) [16:37:42] ok... so at which stage are you ? [16:37:50] the very beginning! [16:37:55] we have a repository :) [16:38:09] https://wikitech.wikimedia.org/wiki/Git-deploy [16:38:20] so this is your first step... reading ... [16:38:35] i am also reading it ... whatever seems unclear ... tell me ... maybe i know [16:38:46] maybe not in which case we fallback to ryan [16:39:16] k reading [16:39:22] also... my first commit for configuring git-deploy for elastic search is https://gerrit.wikimedia.org/r/#/c/88455/ [16:39:33] keep in mind a very simple rule [16:39:44] whatever directory you are deploying from on tin [16:39:53] that is the destination directory on the end servers [16:40:00] !log puppetmaster was overloaded and returning 500s, tried killing open processes and restarting puppetmaster on stafford [16:40:04] Coren: ^ [16:40:16] Logged the message, RobH [16:40:18] puppetmaster is constantly at 100% ... [16:40:24] yep [16:40:28] im not the first person to do this. [16:40:41] yeah i know.. i 've done it too [16:40:44] My own problem is with the puppetmaster on virt0 which just fails to start. [16:40:51] Coren: doesnt it use the same puppetmaster? [16:40:57] oh, maybe not [16:41:03] but i got puppetmaster fails for stafford [16:41:04] heh [16:41:19] akosiaris: So in the past i brought up buying a much heftier machine for puppetmaster [16:41:30] but i think the general consensus was figure out a way to scale horizontally [16:41:40] yeah... i already have something in mind [16:41:45] i just liked the stop-gap solution of throwing some more hardware at it in meantime =] [16:42:11] talked about it with Faidon, i think i am gonna have some time to look at it next week [16:42:16] really, so /srv/deployment everywhere akosiaris? [16:42:44] RobH: so I might show up and say gimme two hefty machines in eqiad :-) [16:43:03] ottomata: well... it is like that now.. it doesn't have to be [16:43:05] wouldn't puppetdb help? [16:43:20] but i chose the same directory and will use symlinks to keep it clean [16:43:26] akosiaris: for puppet, you can have whatever you want. [16:43:27] more than one master would be awesomely awesome [16:43:41] i don't feel like deploying from /usr/share/java/blah blah [16:43:53] akosiaris: so the best i have now is Dell PowerEdge R420, Intel Xeon E5-2440, 32GB Memory, Dual 300GB SSD, Dual 500GB Nearline SAS [16:44:02] i have three of those [16:44:16] also have non ssd and 16GB version [16:44:25] you have many choices =] [16:44:32] https://wikitech.wikimedia.org/wiki/Server_Spares [16:44:44] (but you still have to file procurement ticket, you cannot just take items off that page ;) [16:44:53] RECOVERY - Puppetmaster HTTPS on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [16:44:54] Memory we probably won't need 32G... it is barely 8G at stafford now... [16:45:12] and i never see the machine in IOwait so ... no SSDs ? [16:45:18] then i imagine the Dell PowerEdge R610, dual Intel Xeon X5647, 16 GB Memory works [16:45:25] RobH: virt0's problem, at least, was fixed with a kick in the apache. [16:45:25] which is good, i have many of those. [16:45:47] Coren: yea no worries, i misunderstood your issue and it happened to occur at same time as stafford error report [16:46:42] so 8 core, 16 (yeah right) with HT (but we do disable HT right?) [16:47:03] akosiaris: is there a reason you dropped the / from your deployment target? [16:47:04] "elasticsearch/plugins" => "elasticsearchplugins", [16:47:10] stafford is gonna be unhappy til we get something to share the load with it [16:47:26] apergos: yes :-/ [16:47:48] akosiaris: actually its case by case bassis [16:47:59] we disable HT for apaches, as we have not seen a reason to keep it on [16:48:07] but recently for other items we have found performance increases [16:48:08] ottomata: yes... did not want to pass a / at deployment::target [16:48:20] why not though? just curious [16:48:25] akosiaris: so HT is off or on depending on what you think puppet will respond well to, there is no policy stating it has to be off. [16:49:20] ottomata: well it is the name of the repository [16:49:47] it *isn't* the name of a directory [16:49:55] a) no false impressions that this is a directory [16:50:08] its just an arbitrary name, right? [16:50:16] b) avoid potential problems that might show up [16:50:21] yes ... it is arbitrary [16:50:32] hm [16:50:38] it is a salt grain [16:50:38] hm [16:50:43] ok i will use a hyphen :[ [16:50:45] :p [16:51:29] RobH: ok.. i 'll keep that in mind thanx [16:57:25] (03PS1) 10QChris: Turn off fetching geowiki-data until it comes back in gerrit [operations/puppet] - 10https://gerrit.wikimedia.org/r/88759 [16:58:53] have you guys seen this? https://github.com/jamtur01/puppet-ganglia [16:59:26] it's a puppet reporting plugin that sends stats to ganglia, written by james turnbull [17:00:08] (03PS1) 10Ottomata: Adding deployment target for analytics/kraken repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/88760 [17:00:46] hmm, i'll adapt it for statsd [17:00:49] this is awesome [17:01:51] akosiaris: https://gerrit.wikimedia.org/r/#/c/88760 [17:03:19] (03CR) 10Ottomata: [C: 032 V: 032] Turn off fetching geowiki-data until it comes back in gerrit [operations/puppet] - 10https://gerrit.wikimedia.org/r/88759 (owner: 10QChris) [17:06:27] (03CR) 10Akosiaris: [C: 031] "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88760 (owner: 10Ottomata) [17:09:05] ori-l: the report api is not very well designed [17:09:10] you can only have one [17:09:30] one what ? [17:09:37] report processor [17:09:42] ah... :-( [17:10:10] damn. I was planning to use that for puppet freshness checks [17:11:10] it's a very simply ruby api, could just implement a super-thin reporter that delegates to multiple other reporters [17:17:35] (03PS1) 10Akosiaris: Add plugins head, paramedic and segmentspy [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/88762 [17:18:51] ori-l: someone has done it already :) [17:19:01] but yeah, it's suboptimal [17:19:58] (03CR) 10Akosiaris: [C: 032] Add plugins head, paramedic and segmentspy [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/88762 (owner: 10Akosiaris) [17:21:03] RECOVERY - Puppet freshness on amssq49 is OK: puppet ran at Wed Oct 9 17:20:56 UTC 2013 [17:26:13] akosiaris: howdy. did your git-deploy deployment go ok? [17:27:31] Ryan_Lane: finishing it now... we well soon now [17:27:52] (03CR) 10Akosiaris: [V: 032] Add plugins head, paramedic and segmentspy [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/88762 (owner: 10Akosiaris) [17:28:13] thanx for the help btw... and the docs ... pretty cool :-) [17:28:19] cool, let me know if you run into any issues [17:28:24] yw [17:32:01] (03CR) 10Awjrichards: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [17:32:37] Ryan_Lane: # USER : You must now hand execute the synchronization process and then execute git-deploy finish... I am lost here .... [17:33:13] where's that? [17:33:34] tin:/srv/deployment/elasticsearch/plugins# [17:33:41] it is after a git-deploy --force sync [17:33:44] oh [17:33:53] akosiaris: did you do the temporary hack? [17:33:58] https://wikitech.wikimedia.org/wiki/Git-deploy#.28Temporary_hack.29_As_root.2C_create_a_top_level_directory_for_the_sync_hook [17:34:04] the /var/lib stuff ? yes [17:34:49] hm. how are the directory permissions messed up for your repo? [17:35:06] :D owner by root/wikidev? [17:35:18] you should be using your own user for creating the repo and such [17:35:21] hmmmmm [17:35:29] that i did not know... [17:35:30] that wouldn't cause this problem, though [17:35:42] one sec [17:35:57] ah [17:36:07] /var/lib/git-deploy/hooks/sync/elasticsearchplugins [17:36:43] that should be /var/lib/git-deploy/hooks/sync/elasticsearch [17:36:53] damn [17:36:57] yeah i just noticed [17:37:01] sorry :-( [17:37:11] no worries. this is a stupid error-prone step [17:37:30] I'm working on a change to automate even the bootstrapping [17:37:57] so that no manual steps will be done [17:39:14] do: git deploy abort [17:39:35] just did ... and then start and then --force sync [17:39:37] same error [17:40:42] fatal: Unknown commit none/master [17:40:47] maybe that is the reason ? [17:41:32] (03CR) 10Ryan Lane: [C: 04-1] "Agreed on the inline comments." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88760 (owner: 10Ottomata) [17:42:19] has puppet run on the targets? [17:42:31] it should have... a long time now [17:42:37] * Ryan_Lane nods [17:43:02] one sec [17:43:13] RECOVERY - NTP on amssq49 is OK: NTP OK: Offset -0.01337885857 secs [17:45:59] hm [17:46:28] maybe a permissions issue [17:48:10] File does not exist: /srv/deployment/elasticsearch/plugins/.git/info/refs [17:48:13] on tin [17:48:38] huh ? [17:50:19] that's what apache says [17:51:34] akosiaris: can you do: git deploy abort [17:51:56] done [17:52:36] hm [17:52:41] it says you're still in a deploy [17:52:56] I'm going to wipe out the state files [17:53:14] Running: sudo salt-call -l quiet --out json pillar.data [17:53:14] Running: sudo salt-call -l quiet publish.runner deploy.fetch 'elasticsearch/plugins' [17:53:19] it is still running these... [17:53:21] ah [17:53:25] no wonder [17:53:37] heh. well you're going to get an error in a second :) [17:53:44] just did [17:53:46] because I wiped out the deployer's state files [17:53:56] ok, so I know what happened [17:53:57] Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): [17:54:02] ctrl-c [17:54:12] *** DON'T PANIC *** [17:54:15] heh [17:54:15] looooooooooool [17:54:26] i like the touch :-) [17:55:09] the sync script (which didn't get run because the directory didn't exist) runs something that makes the repo info available to apache [17:55:51] which means the minions can't find the repo [17:56:19] and it put the deployment into a weird state, because it assumes the script is there ;) [17:56:23] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [17:56:30] akosiaris: so, try the deployment again [17:56:34] it'll work this time [17:56:49] I'll push in some changes to make this impossible :) [17:56:54] icinga-wm is very strangely silent [17:57:00] oh, not it isn't [17:57:17] so... let me get this straight first [17:57:35] if i had mkdir the correct path in /var/lib/git-deploy [17:57:39] everything would be ok ? [17:58:02] I need somebody to run the debugger as user parsoid on one of the parsoid machines [18:01:07] akosiaris: yep [18:01:21] # NOTE : Looks like you are all done! Have a nice day. [18:01:31] you too git-deploy :-). Thanks Ryan_Lane :-) [18:01:35] heh. yw [18:03:16] manybubbles: could i restart (or whatever is needed for elastic search to see its plugins ) elasticsearch on testsearch1001 ? [18:04:03] akosiaris: you can restart it, but you shouldn't restart another one until the cluster status has turned 'green' [18:04:20] akosiaris: curl -i localhost:9200/_cluster/health [18:04:29] ok thanks [18:04:37] akosiaris: but if it is a site plugin it might not need a restart [18:07:23] RECOVERY - Puppet freshness on amssq50 is OK: puppet ran at Wed Oct 9 18:07:12 UTC 2013 [18:12:14] any roots around to run some sudo -u parsoid commands for me? [18:13:12] gwicke: tell me [18:14:23] PROBLEM - Puppet freshness on amssq48 is CRITICAL: No successful Puppet run in the last 10 hours [18:14:43] akosiaris: I'd like to attach the debugger to process 7957 on wtp1015: su gwicke; screen -x; sudo -u parsoid node debug -p 7957 [18:15:27] or something like that, so that I can attach to the screen session and poke around in the debugger after it is attached [18:15:33] RECOVERY - NTP on amssq50 is OK: NTP OK: Offset 0.005442976952 secs [18:16:19] I opened a ticket for sudo -u parsoid access too at https://rt.wikimedia.org/Ticket/Display.html?id=5934 [18:17:23] PROBLEM - Puppet freshness on amssq51 is CRITICAL: No successful Puppet run in the last 10 hours [18:19:03] RECOVERY - Puppet freshness on amssq51 is OK: puppet ran at Wed Oct 9 18:18:58 UTC 2013 [18:19:35] gwicke: that ain't gonna work... screen will fail about not being able to open mine (root's) pts device [18:20:35] akosiaris: we did something like this before- I got into the debugger in a shared screen session running as my user [18:21:12] so maybe 'su - gwicke' ? [18:21:45] doesn't have to do with how i change my uid... it about the uid change... [18:21:59] gimme a sec [18:26:21] akosiaris: maybe make that pid 5606 instead [18:28:10] gwicke: try screen -x root/shared [18:28:34] Must run suid root for multiuser support. [18:29:03] RECOVERY - Puppet freshness on amssq53 is OK: puppet ran at Wed Oct 9 18:28:54 UTC 2013 [18:29:33] grrr [18:31:13] Ryan_Lane: should the pmtpa deployment_repo_urls use $deploy_server_pmtpa? [18:31:23] some of them do, some of them use $deploy_server_eqiad [18:31:52] gwicke: again plz it should work now [18:32:22] akosiaris: am in, thanks! [18:32:53] RECOVERY - Puppet freshness on amssq54 is OK: puppet ran at Wed Oct 9 18:32:46 UTC 2013 [18:33:21] gwicke: just don't quit the node debugger... you will back to square zero then [18:33:57] (03PS2) 10Ottomata: Adding deployment target for analytics/kraken repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/88760 [18:34:08] akosiaris: yes, I know [18:34:38] would be good to get sudo -u parsoid access on those machines [18:34:50] ottomata, Coren, is stat1 still a going concern? (I have no reason to doubt it, I'm just following a dependency thread and the stat1 entry in site.pp is the last stop. Want to make sure it's used before I dive in.) [18:35:26] stat1 conecrn about what? [18:35:29] andrewbogott: I know nothing of stat1. [18:35:50] Coren, ottomata, git-blame flags you as having touched its site.pp entry :) [18:36:17] ottomata: by 'going concern' I mean, is it still used, running, etc? Or is it a tampa box marked for death? [18:36:24] Or both? :/ [18:36:29] still running :) [18:36:33] yes, still used, [18:36:42] ok, thanks. [18:36:47] andrewbogott: It may have been when I did the global seek and destroy of foo == "true" [18:36:52] but mainly as an entry point for people to access research db slaves [18:37:14] and as the host of the old user metrics api [18:37:20] which will be deprecated someday? drdee? [18:37:23] ottomata: include role::statistics::cruncher <- the line of interest [18:37:27] are there any other active uses we know of? [18:37:50] yes umapi will be deprecated [18:37:51] yes, it was previously used to crunch private webrequest udp2log data [18:37:55] but that has been moved to stat1002 [18:38:03] RECOVERY - Puppet freshness on amssq55 is OK: puppet ran at Wed Oct 9 18:37:52 UTC 2013 [18:38:05] when does stat1 need to be moved to eqiad? [18:38:25] Oh! So maybe I can purge role::statistics::cruncher and related? [18:39:12] eeeeeeeeeeee [18:39:19] hm [18:39:53] drdee_, are we still doing gerrit stats? [18:39:55] (Context: There's a bunch of code in mysql.pp that is only used by a) stuff I wrote, and b) cruncher ) [18:39:58] no [18:40:32] hmm, i think mysql is not used on stat1 [18:40:33] RECOVERY - NTP on amssq51 is OK: NTP OK: Offset -0.0151873827 secs [18:40:38] i'm looking at the databases there [18:40:48] declerambaul and erosen [18:40:53] neither of them work for WMF anymore [18:41:33] andrewbogott: is it just the misc::statistics::db::mysql class? [18:41:36] we can probalby just remove that [18:42:12] ottomata: Ideally I want to purge generic::mysql::server entirely [18:42:23] PROBLEM - Puppet freshness on amssq52 is CRITICAL: No successful Puppet run in the last 10 hours [18:42:26] manybubbles: seems like testsearch1001 has the plugins you asked deployed just fine. [18:42:58] I will be submitting a gerrit changeset soon to have them enabled everywhere [18:43:13] RECOVERY - Puppet freshness on amssq56 is OK: puppet ran at Wed Oct 9 18:43:05 UTC 2013 [18:44:04] akosiaris: sweet! thanks! [18:44:56] andrewbogott: the only place I see that used by stat1 is the misc::statistics::db::mysql class [18:45:01] so we can just remove that class [18:45:20] and its include from cruncher [18:45:33] ottomata: cool. I'll ping you when I have a patch. [18:45:56] k [18:46:03] (03PS3) 10Ottomata: Adding deployment target for analytics/kraken repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/88760 [18:46:15] (03CR) 10Ottomata: [C: 032 V: 032] Adding deployment target for analytics/kraken repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/88760 (owner: 10Ottomata) [18:49:13] RECOVERY - Puppet freshness on amssq57 is OK: puppet ran at Wed Oct 9 18:49:08 UTC 2013 [18:49:33] RECOVERY - NTP on amssq53 is OK: NTP OK: Offset -0.01760184765 secs [18:50:33] RECOVERY - NTP on amssq56 is OK: NTP OK: Offset 0.005195617676 secs [18:53:13] RECOVERY - NTP on amssq54 is OK: NTP OK: Offset -0.01426458359 secs [18:53:37] (03PS1) 10Akosiaris: Enable elastic search plugins [operations/puppet] - 10https://gerrit.wikimedia.org/r/88788 [18:54:53] RECOVERY - Puppet freshness on amssq58 is OK: puppet ran at Wed Oct 9 18:54:46 UTC 2013 [18:56:14] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:56:53] RECOVERY - Puppet freshness on amssq59 is OK: puppet ran at Wed Oct 9 18:56:44 UTC 2013 [18:57:12] (03CR) 10Akosiaris: "Should be in module or role ? I 'd normally choose the role class as this means the module can be kept rather non WMF-specific, but I saw " [operations/puppet] - 10https://gerrit.wikimedia.org/r/88788 (owner: 10Akosiaris) [18:58:01] hey akosiaris or Ryan_Lane [18:58:06] I need to have an account on tin, right? [18:58:13] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [18:58:13] you don't ? [18:58:18] should I add myself to the admins::mortals class, or should I already be in admins::roots [18:58:19] no [18:58:51] well i did everything as root but i get the feeling ryan would prefer it if you did it with your account [18:59:09] yeah, i got an error [18:59:16] as root [18:59:16] # git deploy start [18:59:16] # FATAL: Your umask is not set properly; got 0022 instead of 0002 [18:59:38] hmm, also, i think i like your symlink changesset better than what I did, not sure though [18:59:45] i just set the deployment location to /srv/analytics [18:59:53] RECOVERY - Puppet freshness on amssq60 is OK: puppet ran at Wed Oct 9 18:59:48 UTC 2013 [18:59:56] (03PS1) 10Andrew Bogott: Removed mysql server from the stats cruncher role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88790 [18:59:56] and then symlinked that on tin to /srv/deployment/analytics, where I actually cloned the repo [19:00:06] ottomata, drdee: ^ [19:00:16] akosiaris: would it be better if i left /srv/deployment/analytics intact and then puppetized a symlink on the analytics nodes? [19:00:44] (03CR) 10Ottomata: [C: 032] Removed mysql server from the stats cruncher role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88790 (owner: 10Andrew Bogott) [19:00:50] andrewbogott: why? [19:01:04] we need mysql on that box [19:01:09] its ok drdee, mysql is not being used there and this does not actually remove mysql server [19:01:10] ottomata: well it might be better from a deployment point of view. Everything under a directory [19:01:13] RECOVERY - NTP on amssq55 is OK: NTP OK: Offset -0.009818077087 secs [19:01:15] drdee_: OK… we just had a lengthy discussion about this [19:01:20] which I thought you were a part of? [19:01:31] we are about to use it for the page view api [19:01:41] well, on tin I actually cloned in side of /srv/deployment [19:01:44] but it's really about fs hygiene... I ain't gonna force it on you :-) [19:01:46] ottomata: "this does not actually remove mysql server" it doesn't? [19:01:52] i just made the configs point at /srv/analytics, so that it would be deployed there [19:01:52] but hm [19:01:56] oky missed that [19:01:57] sorry [19:02:01] no, deletting a class doesn't acutally do anything [19:02:05] you'd have to ensure => absent [19:02:22] (03CR) 10Manybubbles: "I'd go with the role. I think the only WMF specific stuff we have in the module is the java thing." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88788 (owner: 10Akosiaris) [19:02:28] Well, no, but… if you're using it it should stay in puppet [19:02:44] Otherwise I'd be inclined to remove those packages by hand after the patch merges. [19:03:22] drdee_, i'm pretty sure we aren't using it [19:03:37] there aren't any databases there other than declerambaul and erosen [19:03:40] but we are about to start using it ;) [19:03:53] for what? [19:03:57] pageview api [19:04:06] naw, let's put it somewhere else, not in pmtpa, right? [19:04:15] I would hope that anything you're about to start doing would happen in eqiad [19:04:20] ottomata: everything should always be under /srv/deployment [19:04:23] PROBLEM - Puppet freshness on amssq61 is CRITICAL: No successful Puppet run in the last 10 hours [19:04:35] definitely not /srv/analytics [19:04:36] :) [19:04:43] i'd like it deployed there though [19:04:48] ottomata: sure let's talk about it later [19:04:49] too bad? :) [19:04:52] seems weird to deploy software to /srv/deployment [19:04:58] i mean, i can symlink it on the analytics nodes [19:05:08] so that's fine with me [19:05:10] just seems weird [19:05:16] i understand keeping it on tin like that [19:05:18] don't do symlinks [19:05:23] be consistent [19:05:25] but seems weird to force the deployment to a particular directory [19:05:38] it's actually done on purpose [19:05:46] that way you know where the software is on both sides [19:05:46] always [19:05:51] its ok if I make a symlink on the analytics nodes to /srv/deployment analytics, no? [19:05:54] without needing to search for anything, or documentation [19:06:00] that way tin and deployment all is exactly the same [19:06:00] no. please don't [19:06:03] ? [19:06:14] be consistent with how everything else is [19:06:16] not sure why, i'm saying tin and deployment all works the same [19:06:27] (03PS2) 10Akosiaris: Enable elastic search plugins [operations/puppet] - 10https://gerrit.wikimedia.org/r/88788 [19:06:53] I'm not sure how the symlink helps [19:06:53] it just makes things confusing [19:07:09] hm, it makes things not tied to a specific deployment system [19:07:35] akosiaris: just did a symlink for elasticsearch, so that the plugins could be used [19:07:53] don't bring me into this :P [19:07:57] heh [19:08:05] haha, that's where my idea came from! [19:08:10] ottomata: if you don't need a symlink you shouldn't use one :) [19:08:26] if you can configure your software to point at specific locations, that's what you should do [19:08:42] if you can't, which I'd imagine is the case with the plugins, then you can use a symlink [19:08:43] i can, and will if you insist ,but I don't like it. why shoudl I have to configure software FOR git-deploy? [19:09:03] RECOVERY - Puppet freshness on amssq61 is OK: puppet ran at Wed Oct 9 19:08:56 UTC 2013 [19:09:07] git-deploy should be configured for my software :p [19:09:21] it's meant for the sanity of everyone else [19:09:23] Ryan_Lane: actually it does ... [19:09:27] silly me... [19:09:45] ottomata: gonna drop the change that caused all that :-/ [19:09:59] this would be like asking to code a website FOR a particular webserver [19:10:06] ottomata: I can make git-deploy deploy anywhere on the minions, but then people trying to debug stuff need to go fishing [19:10:13] RECOVERY - NTP on amssq57 is OK: NTP OK: Offset -0.01427590847 secs [19:10:28] I made this decision due to the current deployment system and trying to track down shit on the mediawiki systems [19:11:04] yeah i guess so, i just don't like coupling systems together if I can help it [19:11:11] to really figure out what's happening on the mw system you need to go fishing [19:11:16] it takes a while and it's silly [19:11:32] you aren't coupling them [19:11:42] your applications support being able to be deployed anywhere [19:12:01] yeah, in this case its not so bad, just an extra directory on a system for not apparent reason [19:12:04] i'll change it [19:12:07] just had to object :p [19:12:14] RECOVERY - Puppet freshness on amssq62 is OK: puppet ran at Wed Oct 9 19:12:03 UTC 2013 [19:12:23] but there is a good reason ;) [19:12:45] it's for every person that needs to debug your software when you aren't around [19:12:46] also, real quick, shoudl the pmtpa urls use $deploy_server_pmtpa [19:12:49] I know that points at tin [19:12:50] anyway [19:12:52] yeah [19:13:02] but the pmtpa config has both [19:13:15] both deploy_server_pmtpa and deploy_server_eqiad [19:13:19] notice the comments that go with those [19:13:24] ok [19:13:28] you can use eqiad, if you comment it [19:13:30] ah ok, so analytics is eqiad only as well [19:13:35] ok [19:13:53] some of this config will go away anyway [19:14:11] I've been pushing in changes to make this easier to configure the past few weeks [19:14:15] oh cool [19:14:23] yeah there is a lot of manual repetition here that seems like it could be abstracted [19:14:24] cool [19:14:39] maybe I can be convinced not to have the same target location on the minions, but I doubt it [19:14:46] haha [19:14:48] yeah! [19:14:49] ok! [19:14:54] Ok, ottomata, drdee_ -- what's the story? Can I merge that patch and purge mysql from stat1, or should I drop it? [19:15:23] we would need a misq machine a la stat1 in eqiad [19:15:23] let's see: its ugly and it forces all users into a convention that might have nothing to do with their software, i.e somethign that is not mediawiki related? [19:15:37] drdee_, we can use one of the analytics machiens, an27 or an26 [19:15:43] RECOVERY - NTP on amssq58 is OK: NTP OK: Offset -0.02078127861 secs [19:15:53] how much diskspace do those have? [19:16:29] drdee_, ottomata, Also, that new eqiad system would install mysql using the mysql module rather than the current way. [19:16:30] drdee, 2 x 1TB disks [19:16:35] yes [19:17:19] mmmm we could do it i worry about it though but let's take it offline -- also talk with dan and qchris about this [19:17:33] RECOVERY - NTP on amssq59 is OK: NTP OK: Offset -0.01965594292 secs [19:17:42] ottomata: yeah, that's a negative, but it's also immediately obvious where the software is going to be on the minion and that it's a free location to use, which makes the system less error prone as well [19:18:01] what happens when you specify an existing directory? [19:18:12] drdee, what's your gerrit username? [19:18:18] that's your own fault? :p [19:18:19] 'Diederik' [19:18:32] andrewbogott: I say purge away [19:18:39] Oh, it's you! I had no idea :) [19:18:46] (03PS1) 10Ottomata: Pointing analytics/kraken deployment to /srv/deployment/analytics/kraken [operations/puppet] - 10https://gerrit.wikimedia.org/r/88794 [19:19:11] ottomata: so far you're the only one with a really strong objection :) [19:20:18] its my aversion to inelegance :p [19:20:28] i'm not going to fight it so hard…you should let me make a symlink though :D [19:20:45] just on the actually minions (?correct term here) [19:20:53] RECOVERY - NTP on amssq60 is OK: NTP OK: Offset -0.01466119289 secs [19:20:53] it'd be transparent to the deployment system [19:20:59] buuut whaaateevs [19:21:14] I'm fine with symlinks, assuming that it's necessary [19:21:34] if it isn't necessary, it's just an inconsistency to deal with [19:21:55] I'm going to remove the symlink you added on tin [19:22:30] (03CR) 10Manybubbles: [C: 031] Enable elastic search plugins [operations/puppet] - 10https://gerrit.wikimedia.org/r/88788 (owner: 10Akosiaris) [19:22:53] (03PS3) 10Akosiaris: Enable elastic search plugins [operations/puppet] - 10https://gerrit.wikimedia.org/r/88788 [19:22:59] akosiaris, manybubbles: when the plugins change, does the service need to be restarted? [19:23:00] manybubbles: a bit too fast... [19:23:03] that's fine Ryan_Lane, i was about to remove that too [19:23:29] manybubbles: look at PS3 please... it is changing a couple of things to make it more configurable and robust [19:23:44] Ryan_Lane: that depends. The site plugins shouldn't need it but others will [19:23:55] Ryan_Lane: but we want to manually perform all ES restarts any way [19:24:05] ok [19:24:34] (03CR) 10Ottomata: [C: 032 V: 032] Pointing analytics/kraken deployment to /srv/deployment/analytics/kraken [operations/puppet] - 10https://gerrit.wikimedia.org/r/88794 (owner: 10Ottomata) [19:24:57] (03CR) 10Manybubbles: [C: 031] "Even better." [operations/puppet] - 10https://gerrit.wikimedia.org/r/88788 (owner: 10Akosiaris) [19:25:04] (03PS1) 10Ottomata: Moving my account to admins::roots [operations/puppet] - 10https://gerrit.wikimedia.org/r/88797 [19:25:10] RobH: ^ [19:25:12] would you review? [19:25:47] yep [19:25:56] you want a +1 and you merge and handle or a +2? [19:26:23] Ryan_Lane: it juggles shards when a node goes down and has to restore them when the node comes back which all takes some time. [19:27:02] manybubbles: sounds fine to me [19:27:13] I was just wondering because for some services, we have automatic restarts on deploy [19:27:31] I was going to add a more robust way of handling that at some point [19:28:11] manually works, though [19:29:13] RECOVERY - NTP on amssq61 is OK: NTP OK: Offset -0.01825797558 secs [19:31:58] (03CR) 10Ryan Lane: [C: 032] Remove debhelpers for apparmor from labs image [operations/puppet] - 10https://gerrit.wikimedia.org/r/88648 (owner: 10Ryan Lane) [19:33:07] (03CR) 10RobH: [C: 032] Moving my account to admins::roots [operations/puppet] - 10https://gerrit.wikimedia.org/r/88797 (owner: 10Ottomata) [19:33:33] RECOVERY - NTP on amssq62 is OK: NTP OK: Offset -0.01911830902 secs [19:35:13] (03CR) 10Akosiaris: [C: 032] Enable elastic search plugins [operations/puppet] - 10https://gerrit.wikimedia.org/r/88788 (owner: 10Akosiaris) [19:35:28] Is RT duty Monday-Sunday or Saturday-Friday, or… ? [19:35:39] Ryan_Lane: at some point we can turn on a better mechanism than manually, but I'm not sure we know enough of the pain points (of elaticsearch) yet to design one. [19:36:17] andrewbogott: monday to sunday [19:36:21] well, monday to friday [19:36:28] no one expects you to work extra days when on duty. [19:36:41] RobH: Ah, and this coming Monday which is a US holiday? [19:36:49] (That being the essence of my question :) ) [19:37:11] well, if person on duty isnt taking holiday they take over, otherwise I'd say that there is simply no one triaging that day [19:37:21] (03CR) 10Ryan Lane: "It's difficult, but I'm also looking at ways to do this right now as well. The biggest issue is that the system is distributed, so parts o" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86762 (owner: 10Ryan Lane) [19:37:25] and that person got lucky that week with a day less triaging! [19:37:30] ksnider: ^ [19:37:51] Just because mutante and I just come up with RT triage ideas doesn't make them fact, even if thats been the pattern though ;] [19:37:58] RobH: OK, sounds like either way if I want to spend Monday out of doors that's more-or-less acceptable. [19:38:09] yea if you take over next week [19:38:15] i'd say monday simply has no triage done [19:38:17] and thats ok. [19:38:20] Today is the most beautiful day ever, although no doubt it will be sleeting by the weekend :( [19:38:38] i have my car legally tagged for use now [19:38:43] it better be damned nice this weekend. [19:38:51] I want to leave the city damn it [19:38:59] manybubbles: yeah, I'm not worried about it for now [19:39:03] * RobH is gonna go be a hermit in the woods [19:39:16] Well, in SF the chances of a snowstorm are much less [19:39:18] manybubbles: we'll make it automated later, when we figure out how [19:40:46] robh in your luck http://www.weather.com/weather/tenday/USCA0987 [19:40:55] RobH: I'm gonig to a 70 acre plot this weekend in Mendocino county, no cell service even. it'll be great. (helping the owner do apple/grape picking/preserving). it'll be great :) [19:43:59] sounds nice, feel free to bring back resulting work for sharing with immediate desk neighbors [19:45:50] ^demon: https://gerrit-review.googlesource.com/#/c/50250/ [19:46:04] ^demon: openstack people are trying to get work in progress feature merged [19:46:12] and wondered if we'd throw our support in [19:46:18] I'm pro! [19:46:37] same. I think we decided against using this feature because it wasn't upstreamed [19:46:46] we -1 changes right now, which sucks [19:47:08] <^demon> Ryan_Lane: I saw the discussion on repo-discuss for it. [19:47:22] <^demon> It's basically drafts, but you can toggle to/from draft status. [19:47:41] and it doesn't fuck with the history, like drafts do [19:47:43] Do we still have some production systems running Lucid? There sure is a lot of versioncmp($::lsbdistrelease, "12.04") >= 0 in the code I'm looking at... [19:47:48] drafts are a worthless feature [19:48:04] andrewbogott: we have machines running hardy... [19:48:04] I think this should really just replace drafts [19:48:12] akosiaris: dang [19:48:18] :-( [19:49:30] RobH: :) [19:55:46] ^demon: I owe you a package, right? [19:55:48] <_david_> ^demon, would you mean voting to that change? [19:56:19] ^demon: can you add me as a reviewer to your change? [19:56:39] <^demon> Ryan_Lane: I already did ;-) [19:56:40] oh, yu did [19:57:38] <^demon> _david_: Come again? [19:57:59] <_david_> ^demon, can't parse that statement [19:58:19] <^demon> I couldn't parse yours either. Could you please rephrase? :) [19:58:33] RECOVERY - Puppet freshness on amssq48 is OK: puppet ran at Wed Oct 9 19:58:28 UTC 2013 [19:59:15] <_david_> ^demon, ;-) [19:59:31] <_david_> ^demon, OpenStack project needs that change upstream in gerrit: https://gerrit-review.googlesource.com/#/c/50250 [19:59:50] <^demon> Yeah, I saw the discussion on repo-discuss. [20:00:32] <_david_> ^demon, i wonder if you guys can help, because i happen to know how important your opinion to gerrit's maintainer is [20:01:27] <_david_> you mean that thread: https://groups.google.com/forum/#!topic/repo-discuss/jXCL-rc9Dro [20:02:14] <^demon> Ah no, I meant "[Announce] Work In Progress plugin for Gerrit" from about 2 hours ago [20:02:25] COool Ryan_Lane, it worked! [20:02:27] i'm all deployed! [20:02:28] thank you! [20:02:34] <_david_> well, that one has very big history ;-) [20:03:03] <_david_> https://gerrit-review.googlesource.com/#/c/36091/ [20:03:05] ottomata: great :) [20:03:14] it's nice to not need to do anything for new repos :) [20:03:18] <_david_> and that one above is year old ;-) [20:03:27] work was well worth the effort in that regard [20:05:13] RECOVERY - NTP on amssq48 is OK: NTP OK: Offset -0.09636306763 secs [20:07:14] so, Ryan_Lane, maybe you can advise me here on this too [20:07:25] so that was the kraken repo, which has a lot of useful scripts and such that we will need [20:07:31] but, also, we need to build and deploy jars to hadoop [20:07:43] drdee and I are thinking of using jenkins to build the jars (as that is mostly already set up) [20:07:54] and then somehow using git deploy and/or hooks to get the jars over to analytics nodes and into hdfs [20:08:05] does that make sense or is there a better idea? [20:08:24] hm [20:08:30] hashar ^ you might have thoughts too [20:08:49] I'd prefer to have a non-git deployment method for that [20:08:59] we're going to be adding non-git deployment methods to the system soon [20:09:04] hm, well we need to to versioning of the jars too [20:09:08] yeah [20:09:17] the jars don't have to be in git [20:09:26] storing binaries in git isn't very efficient [20:09:30] we will figure out how to configure the builds to do the versioning [20:09:44] but it would be nice if it somehow synced up with git deploy [20:09:50] so that we could make sure versions of things match [20:09:51] not sure [20:10:10] Well… all this logic is to ensure that mysql5.5 is installed on >= Precise, and 5.1 on < Precise. But wouldn't apt automatically do that anyway? [20:11:36] Ryan_Lane: we might be able to configure jenkins to do the jar deployment/syncning [20:11:52] hm. [20:13:01] we should start a wiki page with ideas on how to do this [20:13:12] this is a different problem than we have with other repos [20:13:30] ottomata: hey :-] [20:13:51] ottomata: don't you have an effort to debianize your jar ? [20:13:55] no [20:14:03] that would be really annoying, this is for analysis code [20:14:04] ottomata: as for building jar in Jenkins, Chad did it for Gerrit and Buck iirc [20:14:07] more like an app that changes often [20:14:15] cool, yeah hashar we have it building jars [20:14:16] ottomata: one thought that I'd have is to have jenkins build on tags [20:14:20] we just need them synced to actual nodes [20:14:28] I see there is an ArtifactDeployer plugin [20:14:31] yeah totally [20:14:32] then have the deployment system deploy that tag [20:14:41] git deploy you mean? [20:14:49] or just jenkins? [20:15:01] ottomata: in the usual enterprise, Jenkins would build the jar and deploy them to a maven repo then you could fetch from that repo to update your soft [20:15:03] this is the exact reason i don't like calling the system git-deploy :) [20:15:08] ottomata: at least that is how i understand java deployment. [20:15:25] yes we could publish the jar to nexus [20:15:49] Ryan_Lane: feel free to rename https://wikitech.wikimedia.org/wiki/Git-deploy to Sartoris [20:15:59] does nexus allow multiple versions of something to exist in it? [20:16:06] Ryan_Lane: I should have asked you before doing the rename. I thought earlier that Sartoris was abandonned. [20:16:22] hashar: sartoris isn't abandoned, I just haven't switched to it yet [20:16:38] I think faulkner still pushes in changes fairly often [20:16:57] Ryan_Lane: yes he does. I even have 3 patches pending there :] [20:17:05] Ryan_Lane: yes -- you can use maven to manage your versioning of the jars [20:17:15] yeah, nexus is like apt for jars [20:17:27] there are other things though too, I think manybubbles likes artifactory or somethign [20:17:40] don't fucking care [20:17:43] hah [20:17:45] so, we should figure out which one we want to use [20:17:49] Ryan_Lane: renamed back to Sartoris. Sorry :] [20:18:00] have jenkins automatically push into it [20:18:04] hm [20:18:21] I am not comfortable having Jenkins pushing jar to a repo from which we deploy in production [20:18:30] * Ryan_Lane nods [20:18:36] the deployment system can do that, then [20:18:36] at least not with the CI jenkins [20:18:36] artifactory had a nice feature in that you could convince it to pull only a whitelist of upstream jars and IIRC you could white list by name and hash [20:18:52] ok, here's a possible stumbling block.... [20:18:53] but we can probably setup another secured / internal Jenkins to do that. [20:19:00] does this allow two-phase deployment? [20:19:11] and still build the jar on the CI jenkins for testing out in labs. [20:19:13] RECOVERY - Puppet freshness on amssq52 is OK: puppet ran at Wed Oct 9 20:19:10 UTC 2013 [20:19:21] hashar: most folks do exactly that. they manually trigger release builds on jenkins though [20:19:23] it's not a hard requirement, but it sucks to not have it [20:19:42] manybubbles: Chad has a concept of 'promoting' builds [20:19:53] it's not necessarily an issue to inject things into the repo [20:19:57] even untrusted things [20:20:05] manybubbles: not sure what it does, but as I understand it that let you manually tag a build as stable/ok for release. [20:20:05] assuming we explicitly tell the system what to use [20:20:28] hashar, in jenkins you mean? [20:20:36] ^demon ^^^ [20:20:50] hashar: in maven parlance you'd be turning a build from a SNAPSHOT to a release. it isn't normal. normally the release is its own thing. but that is just maven, I suppose. [20:21:54] what most folks do is have nexus host two repositories - one for SNAPSHOTs and one for releases. [20:22:14] and jenkins pushes into the SNAPSHOTs constantly. [20:22:23] and developers use the snapshot dependencies. [20:22:55] then, when you want to perform a release you freeze all the snapshot in your project, commit that, and have jenkins build that commit hash. [20:23:19] it tags it and all that happy junk and pushes the released artifact to the nexus release repository. [20:23:30] normally only jenkins has the permission to do that [20:23:46] * hashar happily delegates to manybubbles who have clues about Java/Jenkins :-] [20:24:00] promoting a build has the issue that if it was built with SNAPSHOT dependencies then you'll never get that snapshot back. [20:24:20] because SNAPSHOTs aren't for that. they are for keeping up with the latest stuff. [20:24:31] that is why you have to freeze all your dependencies before you release. [20:24:41] it has its own internally consistent logic. [20:24:47] <^demon> Huh? [20:24:52] <^demon> ottomata: wha? [20:24:58] ok... so... [20:25:15] which may be crazy and doesn't come close to matching the debian packaging logic but they remind me of one another in their militancy [20:25:26] now I'll stop flooding the channel [20:25:43] manybubbles: it wasn't flooding, it gives a good idea of how things normally work :) [20:25:55] what we need to do is to map this into the deployment system [20:25:58] yeah very helpful [20:26:19] ^demon, we were talking how to deploying/release jars and you have some experience? [20:26:21] normally that deployment build would build a deployable artifact [20:26:25] and we can modify the deployment system via config options to handle it better, if necessary [20:26:29] or something you'd be able to turn into a deployable artifact [20:26:45] but the latter is preferable. [20:27:06] right now the deployment system assumes it has a git repository that triggers a deployment based on a specific tag [20:27:11] qchris: are you around? do you have experience doing stuff like this too? [20:27:14] elasticsearch builds a deb package as part of its build process. it isn't compliant (to debian standards) but it follows the java model really well. [20:27:26] I am around, yes. [20:27:38] Let me read what you've been discussing. [20:27:40] it writes out a configuration file that the minions use to know which tag they are targeting [20:27:52] done much with CI / jar release deployment / maven/nexus/artifactory/whatever repos before? [20:28:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [20:28:13] Ryan_Lane: could the minions wget jars from the nexus release repository? [20:28:24] yes [20:28:27] not sure its a great idea, but it is simple [20:28:48] it isn't totally necessary to have a two-phase deploy [20:29:08] it's preferable, though [20:29:13] RECOVERY - NTP on amssq52 is OK: NTP OK: Offset 0.007792949677 secs [20:29:23] manybubbles: how are nexus repos normally used? [20:30:04] Ryan_Lane: can you ask the question with other words? I think I've spewed a lot about how they are normally used but I've not covered your question. [20:30:22] ottomata: and it would be good to drop something on the wiki for other people to comment/read about :-] [20:30:33] they could also wget jars from jenkins [20:30:42] manybubbles: how does a client use a nexus repo? [20:30:50] for installation of something from the repo [20:30:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 24.012 second response time [20:30:58] https://integration.wikimedia.org/ci/view/Analytics/job/Kraken/21/org.wikimedia.analytics.kraken$kraken-generic/ [20:31:13] ideally we'd make the deployment system try to just be a wrapper around a normal process [20:31:19] ottomata: normally the release process actually pushes the jar to nexus from jenkins. it is normally built into maven. [20:31:34] for git it just wraps a normal git process [20:31:50] yeah that would be better, pushing to nexus or whatever [20:32:04] Ryan_Lane, from what I've seen of nexus/maven [20:32:13] maven can be configured to point at nexus/artifactory or whatever [20:32:23] and pom.xml files inside of java projects specify dependencies [20:32:39] those deps are then downloaded from the nexus repo and manually mananged inside of a ~/.m2 directory [20:32:51] but i'm not sure how one would manually download a jar from nexus normally [20:32:56] its not like you do [20:32:58] maven install kraken [20:33:19] manybubbles: yeah, that is a good question [20:33:21] Ryan_Lane: normally you'd have the project build a single "deployable" artifact by squashing its dependencies into one jar or war or zip or tar [20:33:23] even if we get these jars into nexus [20:33:24] then what? [20:33:24] well, like I said, I'd prefer the deployment system just be a wrapper for a normal process [20:33:40] manybubbles: how do you actually deploy it, though? [20:33:44] maven install ? [20:33:49] or deb or rpm like elasticsearch does [20:34:02] Ryan_Lane: most people don't put maven on the production systems. [20:34:02] I'm fine using maven, assuming we can sanely wrap it [20:34:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [20:34:14] so, my use is a little extra complicated [20:34:17] Ryan_Lane: it doesn't do well that [20:34:20] ok [20:34:21] the jars need deployed to the analytics nodes, as well as into hdfs [20:34:28] which means extra fancy scripting to put them there [20:34:46] ottomata: the deployment system will let you run any custom code you need after the checkout phase [20:34:54] cool yeah i saw that [20:34:56] * hashar looks for ./fancy.sh [20:34:57] honestly it becomes a free for all once you get the "deployable" artifact into nexus [20:35:03] ok [20:35:06] we could use wget [20:35:07] that's why I was thinking of making it just dl the jars from jenkins (or nexus) [20:35:11] i wish it were nicer and more standard [20:35:14] or curl or whatever [20:35:25] that works well with the two-phase as well [20:35:37] ok, so two parts down [20:35:40] Ryan_Lane: i'm not sure i know what you mean by two-phase? [20:35:46] fetch, checkout [20:35:49] aha ha [20:35:49] ah [20:35:50] rigiht ok [20:35:54] so... [20:35:57] jenkins -> repo [20:36:10] minions fetch from repo [20:36:19] now for the part of actually deploying.... [20:36:29] what do we do on tin? :) [20:36:44] (repo here is e.g. nexus?) [20:36:47] yeah [20:36:52] the repo should likely live on tin [20:37:01] Ryan_Lane: not normally [20:37:14] manybubbles: ? [20:37:21] keep in mind that it will also hold the snapshots [20:37:43] which could be updated frequently [20:37:46] * Ryan_Lane nods [20:37:56] maybe we'd have some form of repo cache there, then? [20:37:57] tin should fetch from the repo, but the repo is generally its own beast [20:38:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 28.661 second response time [20:38:25] the repo is, after all, a pretty big java web app [20:38:26] so, would tin fetch the snapshots and assemble them? [20:38:34] it'd fetch the relases [20:38:35] releases [20:38:41] snapshots are a developer only thing [20:38:46] I see [20:38:53] ops stays away from anything with -SNAPSHOT in it [20:38:57] indeed [20:39:01] because it is tainted by randomness [20:39:09] as tomorrow you couldn't rebuild it exactly [20:39:36] so, maybe we could have a config file managed in a git repo on tin? [20:39:37] manybubbles: we could host a repo on tin that mirrors snapshots from nexus.wmflabs.org? [20:39:39] sorry [20:39:39] it'd make sense to have the minions download the released artifacts (jars, whatever) right from the repo [20:39:41] not snapshots [20:39:42] releases [20:39:43] that has all the hashes necessary [20:40:04] that would work too [20:40:12] Ryan_Lane: if it is from a source you trust all you need is name and version [20:40:27] as part of the deploy phase? [20:40:42] deploy kraken-0.2.x [20:40:44] whatever? [20:40:48] config file on tin inside a git repository that is synced to the minions [20:40:50] then that knows how to get the version from nexus? [20:41:08] maybe the repo on tin would just need to write a snapshot version? [20:41:17] curl nexus.wmfnet/releases/org/wmf/WHATEVER/9.0 [20:41:28] sorry, curl nexus.wmfnet/releases/org/wmf/WHATEVER/9.0/WHATEVER-9.0.jar [20:41:31] or maybe it could just host the configuration file? and the config file would be versioned? [20:42:06] Ryan_Lane: that. keep a config file with the name and version of what you want the minions to fetch from nexus [20:42:09] what's the config file? just the version to deploy? [20:42:16] yes [20:42:23] hmmmm [20:42:35] because nexus keeps released artifacts forever [20:42:48] well, until someone manually smashes them [20:42:58] then to rollback you'd just rollback to the older version [20:43:02] of the config gile [20:43:03] *file [20:43:06] yeah. [20:43:12] hmmmm [20:43:15] I think this sounds like a sane plan [20:43:16] so [20:43:27] we might want multiple versions of jars deployed at once... [20:43:39] ottomata: please describe [20:43:49] since analysis scripts depend on them, and it is risky to change the underlying code without verification [20:43:53] as reported numbers will change [20:43:55] for example [20:44:14] one of the kraken jars has logic to classify a webrequest line as a real pageview [20:44:34] the hadoop jobs that run load up the (versioned?) jar and then output numbers [20:44:47] there will be lots of jobs [20:45:00] wouldn't the versioned jars be in the release bundle? [20:45:01] and we probably won't want to automatically change the underlying logic for all jobs at once [20:45:03] Yes. We definitely have that. We'll need to have kraken-pig-0.1.0.jar and kraken-pig-0.1.1.jar both deployed on a machine. [20:45:48] I'm not sure how java releases are usually done, but I think we were thinking [20:45:55] the version is in the source [20:45:57] inside the maven poms [20:45:58] I'm not familiar enough with hadoop to know: how do you declare which of those jars you want to depend on in your script? [20:46:03] In that example kraken-pig is like the release. [20:46:03] the jar names get built based on that [20:46:15] manybubbles: Importing the jar in the classpath. [20:46:21] if we released something we'd want to keep all previous released jars and 'symlink' the latest to a versionless name [20:46:35] manybubbles: just a file path [20:46:41] so, with version if you need to be specific [20:46:48] we can't actually do symlinks in hadoop [20:46:54] so we were thikning of just copying the latest version to a versionless name [20:47:03] and if you wanted your script to always use wahtever latest version [20:47:12] you'd just load the versionless jar, rather than the versioned one [20:47:21] ottomata: fine by me [20:47:30] I prefer a symlink to a copy, but whatever [20:47:34] but, for lots of production analysis, changing the underlying logic has to be done very intentionally [20:48:08] your fancy deployment script could ensure that all listed jars are there and a particular one is the "default" unversioned one [20:48:10] make the config file a yaml file? have it define a hash of things to deploy? [20:48:21] that'd work great [20:48:41] then deploying a new version would just be a matter of adding the release to the config file [20:48:42] and deploying [20:48:47] and the deploy script would keep everythign in sync [20:48:48] we're adding a new deployment method. we can do anything we want, for the most part ;) [20:49:43] ottomata: ls /usr/share/java [20:51:55] where? [20:52:08] any ubuntu machine should do [20:52:15] ok, symlinks? [20:53:12] would someone mind writing up a synopsis of our discussion here: https://wikitech.wikimedia.org/wiki/Sartoris/Design#Nexus ? [20:53:15] its just that the debian maintainers has a surprisingly similar system [20:53:24] to what I'm describing, yeah [20:53:29] Ryan_Lane: can do [20:53:31] i think i got it [20:53:39] they are a bit more cavalier about dependencies than you want to be [20:53:41] but still [20:53:42] I'll follow-up as well [20:53:52] but we need a relatively good functional spec to go off of [20:54:20] quikc summary to make sure I understand: [20:55:54] - jenkins builds [20:55:54] - jenkins pushes to nexus/artifactory repo [20:55:55] - sartoris supports project config file for jar deployment. config file specifies versions to deploy [20:55:55] - sartoris command to run jar deployment, ... [20:55:58] actually i'm unclear on that last part [20:55:59] then what? [20:56:17] git deploy start [20:56:20] modify config file [20:56:23] git commit [20:56:25] git deploy sync [20:56:33] -> fetch [20:56:48] pulls the jars to the minions [20:56:53] -> checkout [20:56:56] makes them active [20:58:04] the pulling the jars is done via a hook then? [20:58:15] fetch hook to curl the jars from nexus? [20:58:16] Ryan_Lane: "makes them active" means switching that symlink. nothing else is really required because jars should never change once they are in production. [20:58:21] not a hook, no [20:58:29] but they aren't actually in git, [20:58:36] so, we have a salt module [20:58:41] fetch currently does git calls [20:59:13] but we'll turn that method into something generic that says "what kind of repo is this? It's nexus, so call the nexus fetch function." [20:59:23] same with checkout [20:59:40] hm so we use something other than shared.py? that's how all; that happens? [20:59:48] or modify shared.py to be smart [20:59:57] and that just tells salt to do the nexus thing rather than git [20:59:58] ? [21:00:03] oh, sorry. that part is on the frontend [21:00:43] we'll also modify that to understand different repo types [21:01:00] shared currently does some git commands, really those should be in a hook [21:01:24] mostly what shared.py does is call the salt runners (fetch and checkout) [21:01:33] and show reporting info and such [21:01:49] almost all the logic is in the modules on the minions [21:02:37] the runner does basically nothing too. it's mostly just a protection layer [21:03:15] hm, ok i still only half understand (because I don't have a full grasp on sartoris yet), can I write the bit I know and have you fix and fill in the blanks? [21:03:27] yep [21:03:31] that was the plan :) [21:03:34] something high level would be goo [21:03:36] *good [21:03:59] when it comes time to implement, I can walk you through things as well, if you'd like to help [21:04:27] yeah totally [21:04:41] (03CR) 10Dr0ptp4kt: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [21:05:06] ^^awjr you got a minute to google hangout to discuss [21:05:09] ? [21:05:41] dr0ptp4kt: give me a minute to wrap something up and read backscroll [21:06:01] or did you just mean discuss that aprticular patchset/ [21:06:01] awjr, word. i'll be back in about 5 mins [21:06:05] cool :) [21:06:12] awjr, i meant just that patchset. be back in 5 [21:07:38] cool dr0ptp4kt, im done now so just ping me when you're back [21:07:39] ok. lunch :) [21:11:41] Ryan_Lane: https://wikitech.wikimedia.org/wiki/Sartoris/Design#JVM_Application_Deployment [21:12:18] looks good. I'll fill in the rest [21:14:15] <^d> Whee jvm deploys [21:16:00] awjr, back, ready when you are [21:16:11] thanks [21:17:31] ok dr0ptp4kt, ready [21:24:15] (03PS1) 10Ottomata: Adding ironholds to admins::restricted. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88883 [21:24:46] (03CR) 10Ottomata: [C: 032 V: 032] Adding ironholds to admins::restricted. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88883 (owner: 10Ottomata) [21:26:02] (03PS6) 10Dr0ptp4kt: Add an extra header for cache variance of W0 banners for proxies. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 [21:32:02] awjr, thx again [21:32:14] any time dr0ptp4kt [21:46:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [21:47:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 25.760 second response time [21:58:23] PROBLEM - Puppet freshness on cp4001 is CRITICAL: No successful Puppet run in the last 10 hours [21:59:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [21:59:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 24.175 second response time [22:04:52] (03PS1) 10Ori.livneh: Log Puppet run times to StatsD via custom Puppet reporter [operations/puppet] - 10https://gerrit.wikimedia.org/r/88888 [22:05:11] change 88888! [22:06:05] I wonder if I can get 100,000, so that I can have it in gerrit and svn [22:06:46] write a script to submit 11,112 patches [22:08:12] <^d> I should've had svn 100k. [22:08:17] <^d> Ryan's mean and stole it. [22:08:53] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:08:56] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:09:23] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 44%, RTA = 86.48 ms [22:09:33] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.47 ms [22:09:35] ok, why'd you kill the tubes ? [22:10:05] there was nothing good on [22:11:22] :) [22:18:23] PROBLEM - Puppet freshness on cp4014 is CRITICAL: No successful Puppet run in the last 10 hours [22:19:23] PROBLEM - Puppet freshness on cp4019 is CRITICAL: No successful Puppet run in the last 10 hours [22:20:23] PROBLEM - Puppet freshness on cp4015 is CRITICAL: No successful Puppet run in the last 10 hours [22:20:23] PROBLEM - Puppet freshness on cp4005 is CRITICAL: No successful Puppet run in the last 10 hours [22:22:23] PROBLEM - Puppet freshness on cp4017 is CRITICAL: No successful Puppet run in the last 10 hours [22:22:23] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: No successful Puppet run in the last 10 hours [22:30:57] (03PS1) 10Lcarr: adding ulsfo into snmp allow list [operations/puppet] - 10https://gerrit.wikimedia.org/r/88891 [22:31:15] (03PS2) 10Lcarr: adding ulsfo into snmp allow list [operations/puppet] - 10https://gerrit.wikimedia.org/r/88891 [22:32:29] Hm, can anyone explain to me again the difference between the 'shell' and 'ops' keywords on our Bugzilla? [22:33:05] https://bugzilla.wikimedia.org/show_bug.cgi?id=54828 needs someone to run a script on a server, would it make it an 'ops' bug? :-) [22:33:36] (03CR) 10Lcarr: [C: 032] adding ulsfo into snmp allow list [operations/puppet] - 10https://gerrit.wikimedia.org/r/88891 (owner: 10Lcarr) [22:33:50] (03CR) 10Lcarr: [V: 032] adding ulsfo into snmp allow list [operations/puppet] - 10https://gerrit.wikimedia.org/r/88891 (owner: 10Lcarr) [22:33:51] twkozlowski: shell means, usually, running something on the WMF cluster (like a maintenance script, or creating a new wiki, ie: things Reedy can do). "ops" is for things like DNS names or SSL certs or... other things that require 'root' privs [22:35:30] Thanks greg-g [22:37:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [22:37:07] Though, our DNS repo is now public... [22:37:33] Reedy: yeah, but actually making it happen needs opsen though, right? [22:38:27] Yup [22:38:27] <^d> Yeah [22:38:44] All but the same idea with mediawiki-config [22:38:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 18.228 second response time [22:38:56] * greg-g nods [22:39:17] Hm, I've always taken 'shell' to mean 'this has to be added manually to mediawiki-config', but your answer makes sense greg-g :) [22:39:32] whew [22:42:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 30 seconds [22:43:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 23.440 second response time [22:46:40] Importing Pressbilder_till_Ars_11,_reprofotografi_av_bild_från_resealbum_-_Hallwylska_museet_-_87785.tif.../usr/local/bin/mwscript: line 18: 11173 Segmentation fault php "$MW_COMMON_DIR_USE/multiversion/MWScript.php" "$@" [22:46:45] Consistently segfaulting :( [22:52:23] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: No successful Puppet run in the last 10 hours [22:57:51] Anyone around with root fancy helping me find out what's causing php to segfault on terbium when importing this specific image? [22:58:42] that's an 'ops' keyword request ;) [22:58:55] Or a stroopwaffel one [22:59:00] Depending on how you look at it [22:59:01] mmmmmm [22:59:47] * Reedy downloads the 127M file [23:03:29] I've a slight hunch it's metadata related [23:33:01] Stack overflow is seemingly more and more likely [23:34:40] wtf [23:34:45] Run it under valgrind and it completes [23:35:29] <^d> Maybe we should run it under valgrind on the cluster so it'll complete then ;-) [23:36:06] much more leisurely speed of life [23:36:18] ^d, reminds me how I once downloaded a torrent with Wireshark capturing:P [23:36:34] luckily, I have 16 gigs of RAM [23:37:21] Reedy: hey, I'm sort of around [23:38:01] paravoid: Poking around it seems getting coredumps from PHP running not under apache is more difficult, requiring php to have been compiled using --enable-debug or similar [23:38:30] AaronSchulz: all -public + all -deleted are synced now [23:38:32] Can replicate it locally, so got more access rights there obviously [23:38:43] oh, ok [23:38:54] Though, running it under valgrind it completes fine [23:39:09] how about gdb? [23:39:18] + zbacktrace? [23:39:30] ok [23:39:40] (03PS1) 10Ori.livneh: Tee sampled Special:BannerImpression stream to Hafnium for StatsD-ification [operations/puppet] - 10https://gerrit.wikimedia.org/r/88902 [23:39:42] http://p.defau.lt/?K_Q5SRntXxOZxq6XQq_ebw [23:39:44] I saw GETs were down, figured it finished [23:39:58] AaronSchulz: do you want me to attempt mediawiki-config? [23:40:05] (03CR) 10jenkins-bot: [V: 04-1] Tee sampled Special:BannerImpression stream to Hafnium for StatsD-ification [operations/puppet] - 10https://gerrit.wikimedia.org/r/88902 (owner: 10Ori.livneh) [23:41:07] Duh [23:41:18] (03PS2) 10Ori.livneh: Tee sampled Special:BannerImpression stream to Hafnium for StatsD-ification [operations/puppet] - 10https://gerrit.wikimedia.org/r/88902 [23:41:19] paravoid: you can make a patch I guess [23:41:26] (if you want) [23:41:36] I'll fix privatesettings first [23:41:47] https://bugs.php.net/bugs-generating-backtrace.php [23:41:54] or, running from the commandline [23:41:54] gdb /home/user/dev/php-snaps/sapi/cli/php [23:41:54] (gdb) run /path/to/script.php [23:41:54] (gdb) bt [23:41:57] It's badly organised [23:42:29] Ahaaa [23:42:30] Reedy: zbacktrace :) [23:42:47] lots of complaints aboutdebug info mismatch [23:42:48] BUT [23:42:48] AaronSchulz: $wmfSwiftEqiadConfig? [23:42:49] Program received signal SIGSEGV, Segmentation fault. [23:42:49] php_ifd_get16u (value=0xfffffffff9f44320, motorola_intel=0) at /build/buildd/php5-5.4.9/ext/exif/exif.c:1095 [23:42:49] 1095 /build/buildd/php5-5.4.9/ext/exif/exif.c: No such file or directory. [23:42:57] [00:03:28] I've a slight hunch it's metadata related [23:43:06] exif, meet metadata [23:43:36] Reedy: "help directory" [23:43:42] and point to the php tree [23:43:47] so you can see the actual code [23:44:45] paravoid: in PS.php yeah [23:46:03] hm, you did the tempurl right [23:46:07] I remember nothing about it [23:48:07] E: Unable to locate package zbacktrace [23:48:09] apt says no [23:48:38] oh.. in gdb [23:50:05] paravoid: http://torgomatic.us/blog/2013/05/08/an-introduction-to-tempurl-in-openstack-swift/ [23:50:26] manually authenticating and posting...or using swift tools if it's a little faster [23:50:47] I saw that [23:50:56] we don't have the middleware at the pipeline though [23:51:27] never did [23:54:43] return (((uchar *)value)[1] << 8) | ((uchar *)value)[0]; [23:55:10] found zbacktrace? [23:55:14] nope... [23:55:18] https://github.com/php/php-src/blob/PHP-5.4.9/ext/exif/exif.c#L1095 [23:55:24] you have to source .gdbinit form the php source tree [23:55:25] iirc [23:55:33] then you have an extra gdb command [23:55:35] z stands for zend [23:55:39] so it's a bit more friendly [23:56:05] I saw a github link for .gdbinit before.. [23:56:07] so basically apt-get source php5, make sure it's the same version as the one you run [23:56:15] tell gdb to look there for the source [23:56:29] has to be the exact same version, since it's line numbers it references [23:56:50] so "directory /tmp/php5-5.4.9/" or whatever [23:57:00] then "source /tmp/php5-5.4.9/.gdbinit" [23:57:05] could be under some subdir, I don't remembe [23:57:15] https://github.com/php/php-src/blob/master/.gdbinit [23:57:17] then instead of bt/bt full you just type zbacktrace