[00:03:34] (03PS1) 10Dzahn: openstack - aliging and quoting [operations/puppet] - 10https://gerrit.wikimedia.org/r/143794 [00:06:17] (03PS3) 10BryanDavis: [WIP] Allow puppetmaster to send logs to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) [00:07:23] (03PS2) 10Dzahn: openstack - aliging and quoting [operations/puppet] - 10https://gerrit.wikimedia.org/r/143794 [00:16:14] (03PS4) 10Dzahn: install-server-replace generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/138001 (owner: 10Rush) [00:17:31] (03CR) 10Dzahn: [C: 032] "PS4: also add system group" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138001 (owner: 10Rush) [00:20:43] (03CR) 10Dzahn: "checked carbon: PASS (noop)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138001 (owner: 10Rush) [00:24:11] (03PS1) 10Dzahn: facilities (cams) - add system group [operations/puppet] - 10https://gerrit.wikimedia.org/r/143795 [00:26:20] (03PS1) 10Dzahn: role/logging -set system => true for system group [operations/puppet] - 10https://gerrit.wikimedia.org/r/143796 [00:29:59] (03PS1) 10Dzahn: misc/statistics - also add system group [operations/puppet] - 10https://gerrit.wikimedia.org/r/143798 [00:33:22] (03PS2) 10Dzahn: replace deprecated erb template variable syntax [operations/puppet] - 10https://gerrit.wikimedia.org/r/143526 [00:35:17] (03CR) 10Dzahn: [C: 032] deprecated syntax in icinga checkcommands.cfg.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143527 (owner: 10Dzahn) [00:41:29] (03CR) 10Dzahn: [C: 032] role/logging -set system => true for system group [operations/puppet] - 10https://gerrit.wikimedia.org/r/143796 (owner: 10Dzahn) [00:45:16] (03CR) 10Dzahn: "checked erbium: PASS" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143796 (owner: 10Dzahn) [00:47:35] (03CR) 10Dzahn: [C: 032] misc/statistics - also add system group [operations/puppet] - 10https://gerrit.wikimedia.org/r/143798 (owner: 10Dzahn) [00:52:01] (03CR) 10Dzahn: "checked stat1002: PASS" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143798 (owner: 10Dzahn) [00:52:44] (03CR) 10Dzahn: [C: 032] facilities (cams) - add system group [operations/puppet] - 10https://gerrit.wikimedia.org/r/143795 (owner: 10Dzahn) [00:54:00] jouncebot: next [00:54:00] In 14 hour(s) and 5 minute(s): SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140703T1500) [00:54:35] wow, we will deploy on week's last day? [00:55:14] good point, Thursday is Friday [00:55:25] jouncebot: skip [00:55:27] :) [00:55:48] (03CR) 10Dzahn: [C: 032] deployment,replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137993 (owner: 10Rush) [00:56:54] AaronSchulz: RBR will come, yes, or likely mixed. and links schema change is not yet complete [00:57:56] i'm pushing out some partitioning at the same time, on [tp]l_namespace [01:00:38] (03CR) 10Dzahn: "checked tin: PASS" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137993 (owner: 10Rush) [01:08:24] (03CR) 10Springle: [C: 031] modules/coredb_mysql/ sans systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137994 (owner: 10Rush) [01:08:49] (03CR) 10Dzahn: [C: 032] modules/coredb_mysql/ sans systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137994 (owner: 10Rush) [01:09:31] (03CR) 10Springle: [C: 031] deprecated syntax in mysql/generic_my.cnf.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143529 (owner: 10Dzahn) [01:15:24] (03CR) 10Dzahn: "checked db1001: PASS (noop)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137994 (owner: 10Rush) [01:17:47] (03PS5) 10Dzahn: dataset-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138000 (owner: 10Rush) [01:18:51] (03PS6) 10Dzahn: dataset-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138000 (owner: 10Rush) [01:19:30] (03CR) 10Dzahn: [C: 031] dataset-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138000 (owner: 10Rush) [01:20:08] (03CR) 10Dzahn: [C: 04-1] "also add the group" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138007 (owner: 10Rush) [01:20:28] (03CR) 10Dzahn: [C: 04-1] "set "system => true" for the group" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137996 (owner: 10Rush) [01:20:50] (03CR) 10Dzahn: [C: 04-1] "system => true for the group" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137997 (owner: 10Rush) [01:21:31] (03CR) 10Dzahn: [C: 04-1] "please also add the group" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138003 (owner: 10Rush) [01:22:29] (03CR) 10Ori.livneh: [C: 032] asset-check: Use query hack to neither hardcode pagename nor redirect [operations/puppet] - 10https://gerrit.wikimedia.org/r/137239 (owner: 10Krinkle) [01:22:41] (03CR) 10Dzahn: [C: 04-1] "also add the groups" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138002 (owner: 10Rush) [01:23:22] (03CR) 10Dzahn: [C: 032] bump Bugzilla's TTL back to regular 1H [operations/dns] - 10https://gerrit.wikimedia.org/r/143675 (owner: 10Dzahn) [01:23:31] fyi, I'm going to be merging a bunch of patches by Krinkle to asset-check, which is a monitoring script. This deviates from the conditions under which I am normally allowed to merge changes, but Faidon has previously OK'd it for this series of patches explicitly. [01:25:27] (03CR) 10Dzahn: [C: 04-1] "this seems to change more than the message says" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139766 (https://bugzilla.wikimedia.org/66530) (owner: 10Withoutaname) [01:26:05] (03PS2) 10Ori.livneh: asset-check: Minor code clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/137240 (owner: 10Krinkle) [01:26:49] (03CR) 10Dzahn: "eh, nevermind, i guess it's just formatting, not a good reviewer for mediawiki-config though, it's more a platform thing" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139766 (https://bugzilla.wikimedia.org/66530) (owner: 10Withoutaname) [01:26:55] (03PS3) 10Ori.livneh: asset-check: Minor code clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/137240 (owner: 10Krinkle) [01:27:06] (03CR) 10Ori.livneh: [C: 032 V: 032] asset-check: Minor code clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/137240 (owner: 10Krinkle) [01:29:17] (03CR) 10Dzahn: "could you rebase and/or split it up into a couple smaller chunks? thanks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114736 (owner: 10Tim Landscheidt) [01:33:57] (03PS4) 10BryanDavis: [WIP] Allow puppetmaster to send logs to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) [01:36:56] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [01:37:06] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: Fetching origin [01:39:49] oh, that's me [01:39:52] i forgot about that alert [01:40:56] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [01:41:06] RECOVERY - Unmerged changes on repository puppet on palladium is OK: Fetching origin [01:43:03] (03PS5) 10BryanDavis: [WIP] Allow puppetmaster to send logs to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) [01:44:44] (03PS2) 10Dzahn: puppet module for a tor relay (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140948 [01:47:17] that sounds pretty cool [01:49:22] (03CR) 10Faidon Liambotis: [C: 04-1] puppet module for a tor relay (WIP) (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140948 (owner: 10Dzahn) [01:53:06] legoktm: it does :) [01:53:36] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:36] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:24] it would be cool if we could figure out a way to get TorBlock disabled (figure out a way to let tor users contribute, but keep out the bad ones)...but that has more social challenges than technical [01:54:26] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [01:54:26] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [01:55:29] yeah, it's also not our choice [01:55:43] (our = ops or our = foundation probably) [01:59:54] mhm [02:00:07] (03CR) 10Withoutaname: "Slightly more, the bug also requested English "Book" as a namespace alias for "Carte"." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139766 (https://bugzilla.wikimedia.org/66530) (owner: 10Withoutaname) [02:21:56] (03PS6) 10BryanDavis: [WIP] Allow puppetmaster to send logs to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) [02:25:14] (03PS3) 10Dzahn: puppet module for a tor relay (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140948 [02:25:22] (03CR) 10Dzahn: puppet module for a tor relay (WIP) (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140948 (owner: 10Dzahn) [02:26:59] (03PS1) 10Rush: phlegal role banner [operations/puppet] - 10https://gerrit.wikimedia.org/r/143809 [02:27:14] (03PS2) 10Rush: phlegal role banner [operations/puppet] - 10https://gerrit.wikimedia.org/r/143809 [02:32:30] (03PS1) 10Rush: iridium needs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/143811 [02:32:38] !log LocalisationUpdate completed (1.24wmf10) at 2014-07-03 02:31:35+00:00 [02:32:48] Logged the message, Master [02:32:49] (03PS2) 10Rush: iridium needs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/143811 [02:34:50] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt and applied. Results can be seen in dashboard at https://logstash-beta.wmflabs.org/#/dashboard/elasticsea" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [02:36:03] (03CR) 10Rush: [C: 032] "should be ok" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143811 (owner: 10Rush) [02:36:14] (03PS3) 10Rush: phlegal role banner [operations/puppet] - 10https://gerrit.wikimedia.org/r/143809 [02:36:20] (03CR) 10Rush: [C: 032 V: 032] phlegal role banner [operations/puppet] - 10https://gerrit.wikimedia.org/r/143809 (owner: 10Rush) [03:00:26] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 00:59:44 UTC [03:03:19] !log LocalisationUpdate completed (1.24wmf11) at 2014-07-03 03:02:15+00:00 [03:03:22] Logged the message, Master [03:39:53] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 3 03:38:46 UTC 2014 (duration 38m 45s) [03:39:58] Logged the message, Master [04:13:00] (03CR) 10Ori.livneh: [C: 031] "Very nice. A couple of nitpicks inline, but LGTM otherwise." (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [04:40:33] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Thu Jul 3 04:40:31 UTC 2014 [04:47:03] (03CR) 10Chad: "What happens if a node gets restarted accidentally in the meantime?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143754 (owner: 10Chad) [05:13:25] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:15] RECOVERY - DPKG on fenari is OK: All packages OK [05:24:08] <_joe_> yawn [06:10:18] (03PS1) 10Nemo bis: Disable local uploads where unused, per local consensus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143825 (https://bugzilla.wikimedia.org/67453) [06:16:12] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures [06:27:54] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:04] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:14] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:14] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:34] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:44] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:44] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:44] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:54] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 4 failures [06:28:54] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:54] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:54] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:54] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:14] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:14] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:14] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:14] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:15] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:24] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:03] the log files were just rotated [06:30:15] does the new alert work by grepping /var/log/puppet.log? [06:30:37] <_joe_> ori: no idea [06:30:42] <_joe_> lemme check something [06:30:46] -rw-r--r-- 1 root root 0 Jul 3 06:25 /var/log/puppet.log [06:30:46] -rw-r--r-- 1 root root 2625 Jul 3 06:25 /var/log/puppet.log.1.gz [06:30:58] <_joe_> (good morning, btw) [06:31:06] hey, good morning :) [06:31:40] maybe it works by tailing the file and it doesn't handle SIGUP well? [06:31:53] <_joe_> Error 502 on SERVER [06:32:13] <_joe_> ori: it's a puppetmaster problem, that we need to address soon [06:32:14] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:32:33] <_joe_> akosiaris and I were debating this yesterday [06:32:43] <_joe_> it's probably time to add a third puppetmaster [06:33:07] <_joe_> and upgrade everything to trusty where we can use ruby 2 instead of ruby 1.8 [06:33:22] <_joe_> which was infamous for how much it sucked performance-wise [06:34:00] <_joe_> (for instance, threading in ruby 1.8 is basically *fake*) [06:35:14] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.003 second response time [06:35:25] precise has 1.9, doesn't it? [06:35:36] it's not the default ruby, but you can upgrade to it and set it as the default, iirc [06:36:06] and even 1.9 is a vast improvement over 1.8 [06:36:23] <_joe_> ori: no, puppet on precise uses 1.8 [06:36:31] <_joe_> as recommended by puppet labs [06:36:39] oh right, i remember now [06:36:44] <_joe_> they specifically say 1.9 is not to be used [06:36:53] <_joe_> because they're braindead, I suppose [06:38:14] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.014 second response time [06:44:44] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:44:54] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:04] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:04] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:04] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:04] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:45:14] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:45:54] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:45:54] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:55] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:47:14] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:54:15] PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:39] <_joe_> as I told you [06:57:09] (03CR) 10Tim Landscheidt: "@mutante: I intend to do that, but I'm busy till Tuesday." [operations/puppet] - 10https://gerrit.wikimedia.org/r/114736 (owner: 10Tim Landscheidt) [07:08:49] bd808|BUFFER: YAY! [07:11:02] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:06] (03PS1) 10Matanya: terbium: install python-mysqldb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143831 [07:11:13] RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:11:32] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:22] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [07:13:40] matanya: \o/ thanks! [07:13:52] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [07:13:52] :) [07:15:28] hey jenkins, what is up ? [07:20:36] legoktm: now find an op to merge or -1 it [07:22:53] matanya: any suggested ones? I don't do this very often (read: never) :P [07:23:10] oh, you added a bunch of reviewers :) [07:34:34] (03PS1) 10Matanya: puppet monitoring: change lock file path [operations/puppet] - 10https://gerrit.wikimedia.org/r/143833 [07:35:22] (03PS2) 10QChris: Feed logs from ssl terminators again into webstatscollector's filter [operations/puppet] - 10https://gerrit.wikimedia.org/r/143775 (https://bugzilla.wikimedia.org/67456) [07:37:01] (03PS5) 10Filippo Giunchedi: swift: rewrite middle integration test [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 [07:37:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: rewrite middle integration test [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 (owner: 10Filippo Giunchedi) [07:41:24] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:42:24] PROBLEM - DPKG on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:15] RECOVERY - DPKG on fenari is OK: All packages OK [07:43:15] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [07:59:53] good morning [08:00:28] thank you who ever provided the new Jenkins version on apt.wikimedia.org [08:00:48] !log upgrading Jenkins (minor version bump 1.554.2 -> 1.554.3) [08:00:52] Logged the message, Master [08:03:53] hashar: g'morning, I'm going to bed now, and taking today/tomorrow off, so, see you Monday! [08:04:12] greg-g: sounds like a good plan. Enjoy your extended weekend! [08:04:21] will do, camping :) [08:04:42] greg-g: yeah noticed that on your flickr stream :D kudos for bringing kids with you [08:04:49] rush out to bed! [08:04:50] :) :) [08:04:54] indeed, g'night! [08:05:10] * greg-g closes IRC and won't look until Sunday night [08:05:43] (03CR) 10Filippo Giunchedi: [C: 04-1] "mediawiki::users will need changing too I think?" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [08:07:13] !log Jenkins restarted [08:07:19] Logged the message, Master [08:15:06] hashar: good morning with Question :) [08:15:28] kart_: sure! [08:15:32] hashar: We need to have es and ca wikis on Beta Labs. [08:15:45] /s/have/setup [08:16:05] What are the requirements/procedures etc? [08:16:55] with few articles from production is good :) [08:18:27] why would you need them ? :-D [08:19:28] hashar: because, we want to use es-ca pair for ContentTranslation. [08:19:59] Tools like Machine Translation, Dictionary etc are made to work with it. [08:28:49] (03PS2) 10Nemo bis: Disable local uploads where unused, per local consensus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143825 (https://bugzilla.wikimedia.org/67453) [08:43:14] random question, is there a way to get emails for every commit made to gerrit repos? [08:43:38] yes [08:43:42] yep [08:43:47] i have that, although I don't really remember anymore how ;) [08:43:50] must be a setting somewhere [08:43:55] https://gerrit.wikimedia.org/r/#/settings/projects [08:45:57] nice! I tried "all projects" which is what I meant, see if that works as expected [08:46:10] https://www.mediawiki.org/wiki/Gerrit/watched_projects [08:46:45] There's also https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits but only for PS1 and merge [08:47:22] godog: Doubt that will work [08:47:35] Why do you want emails for *all* changes, though? :P [08:48:12] well everything that's submitted really, for my own curiosity mostly [08:48:40] I should have been more specific, emails is just a convenient way for me, anything to the same effect would do (e.g. searchable) [08:49:14] Nemo_bis: that's only for mw though? [08:49:17] no [08:49:27] IIRC [08:50:19] okay I'll try that too [08:50:52] more questions! is there a way to search the whole codebase? (i'm assuming gerrit == the whole codebase) [08:51:26] ack-grep :P [08:51:57] godog: your mail will not like that [08:52:07] I and Yuvi maintain a checkout of all mediawiki.* repos https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Shared_files [08:52:19] I did that in the past, and was heavy on mail load [08:52:36] But most will use github search I think, there is a bug report with some tricks [08:52:55] matanya: true, I'm sure gmail isn't particularly bothered by that email volume though [08:53:09] oh, right. gmail [08:53:43] godog: if you want your mail blocked, do that and subscribe to bugs-l, that will do :) [08:54:11] hehehe [08:54:28] it's a pretty low amount of email [08:54:31] Nemo_bis: ah yeah github search is a good trick indeed, didn't think of that [08:57:52] <_joe_> mark: matanya does not receive our cronspam [08:58:11] <_joe_> that's why he thinks that is high-traffic :P [08:59:24] kart_: yeah sorry lets do it [08:59:54] kart_: the database might be there already there is a guide at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Add_a_wiki [09:00:14] kart_: gotta add entries to all-labs.dblist and wikiversions-labs.dat [09:00:38] the apache conf should be fine [09:01:53] (03PS4) 10Giuseppe Lavagetto: mediawiki: collect apc variables via diamond [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 [09:04:37] (03PS1) 10Hashar: beta: create cawiki and eswiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143845 [09:04:49] wow. That's quick, hashar :) [09:05:18] kart_: did you have a bug filled? Just wondering [09:05:55] (03CR) 10Hashar: [C: 032] beta: create cawiki and eswiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143845 (owner: 10Hashar) [09:05:58] hashar: no. [09:06:01] (03Merged) 10jenkins-bot: beta: create cawiki and eswiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143845 (owner: 10Hashar) [09:11:37] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: collect apc variables via diamond [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [09:14:43] <_joe_> and I broke puppet on mw*! yay! [09:14:48] <_joe_> fixing [09:15:39] <_joe_> no just on my canary host, luckily enough. [09:15:45] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [09:17:15] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Complete puppet failure [09:18:55] (03PS1) 10Giuseppe Lavagetto: mediawiki-monitoring: use the correct dependency [operations/puppet] - 10https://gerrit.wikimedia.org/r/143846 [09:19:55] (03PS2) 10Giuseppe Lavagetto: mediawiki-monitoring: use the correct dependency [operations/puppet] - 10https://gerrit.wikimedia.org/r/143846 [09:25:23] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki-monitoring: use the correct dependency [operations/puppet] - 10https://gerrit.wikimedia.org/r/143846 (owner: 10Giuseppe Lavagetto) [09:28:18] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:35:33] hoo: btw "all projects" seems to be valid as a watched project, not sure what will happen https://gerrit-review.googlesource.com/Documentation/user-notify.html#user [09:36:18] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures [09:39:23] ok :D Have fun with the spam... [09:41:16] godog: it's valid but is/was restricted to some privileged users [09:43:05] oh ok Nemo_bis ! [09:50:21] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:52:40] (03PS1) 10Giuseppe Lavagetto: mediawiki-monitoring: fix mountpoints, python class name [operations/puppet] - 10https://gerrit.wikimedia.org/r/143849 [09:53:31] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 7219 seconds [09:54:01] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 7255 seconds [09:54:25] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 7219 seconds Sean Pringle schema change - The acknowledgement expires at: 2014-07-04 11:54:02. [09:54:25] ACKNOWLEDGEMENT - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 7255 seconds Sean Pringle schema change - The acknowledgement expires at: 2014-07-04 11:54:02. [09:54:41] (03CR) 10JanZerebecki: puppet module for a tor relay (WIP) (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140948 (owner: 10Dzahn) [09:54:46] <_joe_> oh ok [09:55:09] sorry for the bother [09:55:30] didn't schedule quite enough time [09:56:40] <_joe_> springle: don't worry [09:56:55] <_joe_> I just wanted you not to wake up, in case :) [09:57:06] :) [09:59:20] PROBLEM - MySQL Replication Heartbeat on db72 is CRITICAL: CRIT replication delay 7323 seconds [09:59:30] PROBLEM - MySQL Slave Delay on db72 is CRITICAL: CRIT replication delay 7332 seconds [10:00:36] heh [10:00:44] * springle whistles [10:06:28] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki-monitoring: fix mountpoints, python class name [operations/puppet] - 10https://gerrit.wikimedia.org/r/143849 (owner: 10Giuseppe Lavagetto) [10:10:40] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [10:10:50] https://commons.wikimedia.org/wiki/Special:AbuseLog/526644 brilliant :P [10:11:47] Anyone doing shell uploads or is something else broken? [10:19:54] (03PS1) 10Giuseppe Lavagetto: mediawiki-monitoring: correct diamond naming, path [operations/puppet] - 10https://gerrit.wikimedia.org/r/143851 [10:20:25] <_joe_> hoo: mmmh why do you ask? [10:20:34] <_joe_> oh, the link above [10:21:42] <_joe_> interesting, I don't have permissions to see that log [10:21:54] oO [10:22:23] <_joe_> hoo: that's my fault for sure [10:22:36] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki-monitoring: correct diamond naming, path [operations/puppet] - 10https://gerrit.wikimedia.org/r/143851 (owner: 10Giuseppe Lavagetto) [10:23:03] _joe_: Do you have an official global account? We should set you the sysadmin bit [10:23:26] <_joe_> hoo: I do have the official account and I forgot to remember someone to set that [10:23:43] haha, yeah [10:24:07] <_joe_> I'll ask mutante this evening :P [10:24:14] I guess you have to drop Philippe an email or so [10:24:29] he formally approves those [10:29:37] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 7 below the confidence bounds [10:29:57] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:28] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:28] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:28] PROBLEM - check configured eth on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:28] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:31:27] RECOVERY - check configured eth on tungsten is OK: NRPE: Unable to read output [10:31:27] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [10:31:27] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [10:31:28] RECOVERY - DPKG on tungsten is OK: All packages OK [10:32:47] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [10:33:28] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [10:50:50] (03PS5) 10Hashar: zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 [10:52:07] (03CR) 10jenkins-bot: [V: 04-1] zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 (owner: 10Hashar) [10:54:16] (03PS6) 10Hashar: zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 [10:56:07] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[python-yaml] is already declared in file /etc/puppet/modules/diamond/manifests/collector/minimalpuppetagent.pp:13; cannot redeclare at /etc/puppet/modules/zuul/manifests/init.pp:65 on node i-00000311.eqiad.wmflabs [10:56:10] HOLY SHiT PUPPET [10:56:18] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures [10:57:05] hashar: I asked Krinkle|detached about it, he said it was ok [10:57:27] YuviPanda: puppet failing can't be ok :D [10:57:39] hashar: I agree. [10:58:04] hashar: two options - use ensure_packages in both the places [10:58:23] hashar: hmm, actually that's the only option at this point :| [10:58:41] another option would be to have python-yaml installed everywhere [10:58:46] but yeah ensure_packages() would do [11:00:18] hashar: yeah. [11:00:25] hashar: we'll have to switch both [11:00:29] hashar: wanna do it? [11:01:25] YuviPanda: do it please :-D [11:01:32] though we have two define [11:01:38] ensure_packages (with a S) which comes from stdlib [11:01:44] and ensure_package which is a generic define we made [11:01:47] hashar: right [11:01:53] hashar: I guess we'll use ensure_packages [11:03:36] <_joe_> YuviPanda: ensure_packages did not work in 2.7 [11:03:45] ah [11:03:50] <_joe_> ensure_resource from stdlib should work now on 3 [11:05:03] (03PS7) 10Hashar: zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 [11:06:17] (03CR) 10Hashar: "PS5 use a common template for both server and merger using $zuul_role to switch" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 (owner: 10Hashar) [11:07:28] _joe_: hmm, are we on puppet 3 everywhere? [11:07:31] in prod as well, I mean? [11:07:44] <_joe_> YuviPanda: yes [11:07:48] _joe_: ah, cool. [11:08:57] (03PS8) 10Hashar: zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 [11:10:22] (03PS1) 10Yuvipanda: Use ensure_packages for things that include python-yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/143857 [11:10:27] hashar: _joe_ ^ [11:10:42] (03CR) 10jenkins-bot: [V: 04-1] Use ensure_packages for things that include python-yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/143857 (owner: 10Yuvipanda) [11:10:46] hah [11:10:55] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143857 (owner: 10Yuvipanda) [11:11:00] might be Gerrit [11:11:06] (03PS2) 10Yuvipanda: Use ensure_packages for things that include python-yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/143857 [11:11:13] hashar: needed a rebase [11:11:59] <_joe_> YuviPanda: I'm not a fan of ensure_resource btw [11:12:16] _joe_: since it'll still conflict with package resources? [11:12:20] <_joe_> I'll take a look though [11:12:42] <_joe_> YuviPanda: no, it's just a sign you made bad design decision, in 90% of cases [11:12:45] I don't know of any other way to do this other than having a packages::python::yaml and including it [11:13:16] _joe_: I did try to not do the package declaration in diamond, but I couldn't figure out how to do that. Would be happy to know of another way! [11:13:19] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:13:31] <_joe_> YuviPanda: I'll take a look in 5 [11:13:35] _joe_: ty [11:13:43] <_joe_> if I had to reload apache on all mw-servers [11:13:51] YuviPanda: packages::* is frowned upon for some reason I can't remember about [11:13:52] <_joe_> is there a sane way to do that [11:14:00] <_joe_> instead of using salt brutally? [11:14:10] we used to have an apache-gracefull-all script [11:14:26] hashar: yeah, I've heard that as well, probably because we're duplicating all of apt manually :) I'm not a fan of that [11:14:59] _joe_: modules/apachesync/files/apache-graceful-all that uses dsh [11:15:20] <_joe_> hashar: ok I'll give a look [11:16:04] * YuviPanda has been using pdsh instead, has been quite nice / faster [11:17:39] (03PS9) 10Hashar: zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 [11:18:55] (03CR) 10Hashar: "Rebased on top of https://gerrit.wikimedia.org/r/#/c/143857/ "Use ensure_packages for things that include python-yaml"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 (owner: 10Hashar) [11:20:00] <_joe_> yeah my doubt was, do we use some sort of splay there? [11:20:26] (03PS5) 10Hashar: zuul: migrate statsd_host to zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141657 [11:21:18] !google define splay [11:23:41] (03PS1) 10Giuseppe Lavagetto: mediawiki-monitoring: deploy to all mw appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/143859 [11:27:59] (03CR) 10Hashar: [C: 031 V: 031] "Solved the duplicate python-yaml definition I had on integration-dev.eqiad.wmflabs . That instance has Zuul installed which has a package" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143857 (owner: 10Yuvipanda) [11:28:49] (03CR) 10Hashar: [C: 031 V: 032] "Works fine on labs integration-dev.eqiad.wmflabs . In prod, would need to double check on gallium.wikimedia.org that everything went fine" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 (owner: 10Hashar) [11:30:51] (03CR) 10Hashar: [C: 031 V: 031] "Works on lab. Will have to make sure statsd_host is properly set in production." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141657 (owner: 10Hashar) [11:37:34] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 119 seconds [11:37:36] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 120 seconds [11:42:23] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [11:58:23] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [12:03:32] (03PS1) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [12:06:54] (03PS1) 10Springle: db1017 is no longer s5-analytics-slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/143862 [12:09:45] (03CR) 10Springle: [C: 032] db1017 is no longer s5-analytics-slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/143862 (owner: 10Springle) [12:10:24] (03PS2) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [12:13:16] (03PS3) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [12:16:16] (03PS4) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [12:17:55] (03PS5) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [12:23:04] (03PS6) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [12:25:55] hola _joe_ do you have sec to help a bit with a graphite check on puppet? [12:29:27] <_joe_> nuria: I'm at lunch, and I'm taking a longer pause today to work with people in SF [12:29:45] <_joe_> I didn't set myself away, in fact, sorry [12:29:46] ok, np, thanks _joe_ [12:45:05] (03PS1) 10Nuria: Graphite monitoring for NavigationTiming to Event Logging [operations/puppet] - 10https://gerrit.wikimedia.org/r/143865 (https://bugzilla.wikimedia.org/67073) [12:46:25] _joe_: any clue whether YuviPanda ensure_packages(['python-yaml']) has a chance to land? :-D [12:47:30] diamond::collector::minimalpuppetagent is only included in role::labs::instance so should be safe for prod :D [12:47:48] (ref https://gerrit.wikimedia.org/r/143857 ) [12:49:39] diamond::collector::minimalpuppetagent? [12:49:40] eewww [12:54:07] (03PS5) 10Krinkle: asset-check: Track POST requests, redirects, http4xx, and http5xx [operations/puppet] - 10https://gerrit.wikimedia.org/r/137248 [12:54:09] (03PS3) 10Krinkle: asset-check: Track whether requests are compressed with gzip [operations/puppet] - 10https://gerrit.wikimedia.org/r/137253 [12:54:11] (03PS4) 10Krinkle: asset-check: Use content-length header when response.bodySize is missing [operations/puppet] - 10https://gerrit.wikimedia.org/r/137252 [12:54:13] (03PS3) 10Krinkle: asset-check: Track uncaught exceptions in javascript [operations/puppet] - 10https://gerrit.wikimedia.org/r/137257 [12:54:15] (03PS2) 10Krinkle: asset-check: Use "response.stage" property to filter out duplicates [operations/puppet] - 10https://gerrit.wikimedia.org/r/137242 [12:54:17] (03PS3) 10Krinkle: asset-check: Track number of registered modules and their state [operations/puppet] - 10https://gerrit.wikimedia.org/r/137258 [12:54:19] (03PS2) 10Krinkle: asset-check: Implement --debug [operations/puppet] - 10https://gerrit.wikimedia.org/r/137241 [12:54:41] (03PS1) 10Nuria: Correcting text output of check_graphite script [operations/puppet] - 10https://gerrit.wikimedia.org/r/143868 [12:54:57] git checkout -b master -t origin/production [12:55:24] yay! Never have to deal with 'production' again, so I can finally share my brain again with other terminal tabs open on other git repos [13:05:36] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Complete puppet failure [13:07:38] Invalid parameter hive_server_host at /etc/puppet/manifests/role/analytics/hue.pp:53 on node analytics1027.eqiad.wmnet [13:09:49] sigh. cdh4 vs cdh submodule incompatibility [13:10:37] the gift that keeps on giving [13:10:38] :P [13:10:58] https://commons.wikimedia.org/wiki/File:Sting.ogg [13:13:31] (03CR) 10Faidon Liambotis: [C: 04-1] "Cool stuff!" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140948 (owner: 10Dzahn) [13:13:36] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [13:13:50] (03PS3) 10Faidon Liambotis: MX switch, part 4 [operations/dns] - 10https://gerrit.wikimedia.org/r/143560 [13:13:57] (03CR) 10Faidon Liambotis: [C: 032] MX switch, part 4 [operations/dns] - 10https://gerrit.wikimedia.org/r/143560 (owner: 10Faidon Liambotis) [13:23:31] (03PS1) 10Alexandros Kosiaris: Stabilize the output of stdlib's keys function [operations/puppet] - 10https://gerrit.wikimedia.org/r/143870 [13:28:08] (03PS1) 10Aude: Tweak wgPropertySuggesterMinProbability to 0.71, from 0.8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143871 [13:29:41] (03CR) 10Ottomata: "Trailing whitespace :) Aside from that this is just a comment change, right?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143865 (https://bugzilla.wikimedia.org/67073) (owner: 10Nuria) [13:30:26] (03CR) 10Ottomata: [C: 031] "LGTM, Giuseppe should approve" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143868 (owner: 10Nuria) [13:35:23] (03PS2) 10Alexandros Kosiaris: Remove zfs monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/142569 [13:35:46] (03CR) 10Alexandros Kosiaris: [C: 032] Remove zfs monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/142569 (owner: 10Alexandros Kosiaris) [13:36:17] (03CR) 10Alexandros Kosiaris: [C: 032] Define special jobs priorities [operations/puppet] - 10https://gerrit.wikimedia.org/r/139841 (owner: 10Alexandros Kosiaris) [13:36:42] <_joe_> hashar: hey here I am [13:38:47] _joe_: mind? https://gerrit.wikimedia.org/r/143833 [13:38:58] _joe_: quick review on https://gerrit.wikimedia.org/r/143870 [13:39:01] (03CR) 10Giuseppe Lavagetto: [C: 032] "Thanks nuria! Now messages should be clearer" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143868 (owner: 10Nuria) [13:39:21] <_joe_> wow guys you were fast :P [13:39:36] i was before akosiaris :P [13:39:53] I got matanya's change [13:39:59] thanks :) [13:40:20] <_joe_> akosiaris: I'm merging your last change as well, it appears [13:40:30] 2 actually [13:40:39] thanks, I was about too [13:40:45] <_joe_> ok :) [13:42:18] (03PS2) 10Giuseppe Lavagetto: Graphite monitoring for NavigationTiming to Event Logging [operations/puppet] - 10https://gerrit.wikimedia.org/r/143865 (https://bugzilla.wikimedia.org/67073) (owner: 10Nuria) [13:42:28] (03CR) 10Giuseppe Lavagetto: [C: 032] Graphite monitoring for NavigationTiming to Event Logging [operations/puppet] - 10https://gerrit.wikimedia.org/r/143865 (https://bugzilla.wikimedia.org/67073) (owner: 10Nuria) [13:42:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Not the correct lock file. you need /var/lib/puppet/state/agent_disabled.lock. Also note that https://gerrit.wikimedia.org/r/#/c/142560/ s" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143833 (owner: 10Matanya) [13:43:46] (03CR) 10Giuseppe Lavagetto: [C: 031] "While bad in general, we want key ordering in hashes in puppet. LGTM." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143870 (owner: 10Alexandros Kosiaris) [13:44:12] _joe_: thanks [13:44:19] let's see if that makes the differ happier [13:44:27] <_joe_> akosiaris: it should! [13:44:46] <_joe_> akosiaris: try to use the jenkins-integrated one [13:47:13] (03PS2) 10Matanya: puppet monitoring: change lock file path [operations/puppet] - 10https://gerrit.wikimedia.org/r/143833 [13:47:55] akosiaris: fixed [13:48:21] (03PS2) 10Giuseppe Lavagetto: Correcting text output of check_graphite script [operations/puppet] - 10https://gerrit.wikimedia.org/r/143868 (owner: 10Nuria) [13:48:42] (03CR) 10Alexandros Kosiaris: [C: 032] Stabilize the output of stdlib's keys function [operations/puppet] - 10https://gerrit.wikimedia.org/r/143870 (owner: 10Alexandros Kosiaris) [13:50:54] (03PS3) 10Alexandros Kosiaris: apt: minor lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/143271 (owner: 10Matanya) [13:53:21] (03PS3) 10Giuseppe Lavagetto: Correcting text output of check_graphite script [operations/puppet] - 10https://gerrit.wikimedia.org/r/143868 (owner: 10Nuria) [13:53:23] _joe_: I was wondering about YuviPanda patch " ensure_packages('python-yaml') [13:53:24] akosiaris: saw the keys function, didn't know if you were aware of https://github.com/wikimedia/operations-puppet/blob/production/modules/wmflib/lib/puppet/parser/functions/ordered_json.rb [13:53:34] <_joe_> hashar: next in line :) [13:53:34] I too solved that problem again :) and then found ori's [13:54:21] (03CR) 10Giuseppe Lavagetto: [V: 032] Correcting text output of check_graphite script [operations/puppet] - 10https://gerrit.wikimedia.org/r/143868 (owner: 10Nuria) [13:54:40] chasemp: I wasn't, thanks [13:55:21] I know that one is being used in a few places, might break out the sort and json parts to keep it all sane and consolidated, but anyways just a heads up [13:57:42] (03CR) 10Giuseppe Lavagetto: [C: 031] "I think this is one of the cases where usage of ensure_package is legit." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143857 (owner: 10Yuvipanda) [13:57:55] <_joe_> hashar: still not that satisfied, anyway [13:57:58] (03CR) 10Rush: [C: 04-1] "So I totally get the problem I think :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 (owner: 10Yuvipanda) [13:58:14] _joe_: lets migrate out of puppet? :D [13:58:50] _joe_: note we have two version: ensure_package which is a generic definition made by ourself, and ensure_packages that comes from stdlib. [13:59:13] <_joe_> hashar: we could settle on a set of python packages everyone would use and install them everywhere [13:59:36] or include packages::python::yaml *grin* [13:59:47] <_joe_> no, please :) [14:00:29] (03PS2) 10Giuseppe Lavagetto: mediawiki-monitoring: deploy to all mw appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/143859 [14:00:38] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki-monitoring: deploy to all mw appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/143859 (owner: 10Giuseppe Lavagetto) [14:00:59] <_joe_> Reedy, bd808|BUFFER : this change will make all hosts collect apc metrics [14:01:05] <_joe_> ping me for the details [14:01:37] <_joe_> (I'll send an email later anyway) [14:02:06] <_joe_> hashar: jenkins is lagging behind me [14:02:08] <_joe_> :) [14:03:14] <_joe_> mh that change actually removes it from the only host where it was installed, :P [14:04:57] _joe_: who else should review / merge the ensure_packages() thing? [14:05:23] <_joe_> hashar: oh sorry, you need a +2 [14:05:32] <_joe_> ok [14:05:36] <_joe_> give me 2 mins [14:06:25] (03PS1) 10Giuseppe Lavagetto: mediawiki: enable the monitoring vhost everywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/143875 [14:06:30] hehe, seems like the keys.sort trick worked :-) [14:06:53] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: enable the monitoring vhost everywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/143875 (owner: 10Giuseppe Lavagetto) [14:07:54] (03CR) 10Alexandros Kosiaris: [C: 032] puppet monitoring: change lock file path [operations/puppet] - 10https://gerrit.wikimedia.org/r/143833 (owner: 10Matanya) [14:08:39] (03PS3) 10Giuseppe Lavagetto: Use ensure_packages for things that include python-yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/143857 (owner: 10Yuvipanda) [14:08:47] (03CR) 10Giuseppe Lavagetto: [C: 032] Use ensure_packages for things that include python-yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/143857 (owner: 10Yuvipanda) [14:11:13] (03CR) 10Ottomata: [C: 031] "Thanks Christian!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143775 (https://bugzilla.wikimedia.org/67456) (owner: 10QChris) [14:12:47] ottomata: Invalid parameter hive_server_host at /etc/puppet/manifests/role/analytics/hue.pp:53 on node analytics1027.eqiad.wmnet [14:13:05] cdh4 vs cdh incompatibility me thinks [14:14:40] <_joe_> hashar: merged [14:14:57] oo thanks akosiaris [14:15:12] _joe_: great :] [14:17:57] _joe_: I reworked a bit my zuul patches. Could use a review then get them merged tomorrow or on monday. https://gerrit.wikimedia.org/r/141572 and https://gerrit.wikimedia.org/r/141657 I have added you as reviewer on both [14:18:36] <_joe_> hashar: I'll do my best [14:24:56] (03PS6) 10Hashar: zuul: patch of doom (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 [14:25:02] (03PS1) 10Ottomata: Fix hue error in production [operations/puppet] - 10https://gerrit.wikimedia.org/r/143878 [14:27:13] (03CR) 10Hashar: "Rebased / fixed conflict caused by zuul:config" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 (owner: 10Hashar) [14:28:04] (03CR) 10Ottomata: [C: 032 V: 032] Fix hue error in production [operations/puppet] - 10https://gerrit.wikimedia.org/r/143878 (owner: 10Ottomata) [14:33:01] (03CR) 10Hashar: "This is apparently a noop on my test instance which is expected;" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 (owner: 10Hashar) [14:33:22] see you tomorrow! [14:36:30] (03PS4) 10Krinkle: Add puppet module for a tor relay (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140948 (owner: 10Dzahn) [14:36:59] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [14:36:59] (03CR) 10Krinkle: "Moved dangling RT line from body to the footer" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140948 (owner: 10Dzahn) [14:52:28] Krinkle: Will you push that json2 revert? [14:52:51] chasemp: so the problem with https://gerrit.wikimedia.org/r/#/c/143861/ is that the other ones that use sudo just run a command, while in this one I've to figure a way out for a python process that's running as one user to be able to read one particular file as another user, so that's going to be a bit... ugly [14:53:19] so I'll have to open an external process that runs as that user and does something silly like cat that into stdout which I can then read.. [14:58:35] yuvi, in this case could we not allow the diamond to read that status file instead of doing that? [14:58:55] I know it at first seems weird as each collector is a piece of the overall, but in reality we are trusting it to do this either way [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140703T1500) [15:00:35] Nothing to SWAT this morning [15:03:09] chasemp: I'm confused. I think setting the chmod is more secure, since anything with actual sensitive info is protected at the file level [15:03:31] chasemp: and this also makes it possible to actually have puppet freshness checks on labs [15:03:39] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 963.278549634 [15:08:24] YuviPanda: asking other people for some input, I'm not against this entirely, just not sure [15:11:04] chasemp: alright. I'm ok with writing the shell-out-to-read-file solution as well, if you want. Just is a bit unclean [15:11:26] yeah I agree, I don't have an affection for that either :) [15:11:59] chasemp: :) [15:17:22] hoo|away: I will not push that revert unless there is reason to believe it will fix the issue [15:17:36] hoo|away: And we'll have to revert other things first because modules depend on this now [15:20:22] (03PS5) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with trusty packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [15:26:09] (03PS1) 10Faidon Liambotis: Move "mail aliases to OIT" cron to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/143886 [15:26:11] (03PS1) 10Faidon Liambotis: mail: remove secondary MX role from sodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/143887 [15:27:58] (03PS2) 10Faidon Liambotis: mail: remove secondary MX role from sodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/143887 [15:28:00] (03PS2) 10Faidon Liambotis: Move "mail aliases to OIT" cron to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/143886 [15:29:17] (03CR) 10jenkins-bot: [V: 04-1] mail: remove secondary MX role from sodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/143887 (owner: 10Faidon Liambotis) [15:31:34] (03PS3) 10Faidon Liambotis: mail: remove secondary MX role from sodium [operations/puppet] - 10https://gerrit.wikimedia.org/r/143887 [15:33:31] (03CR) 10Faidon Liambotis: [C: 032] Move "mail aliases to OIT" cron to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/143886 (owner: 10Faidon Liambotis) [15:44:05] ori: ping [15:48:56] ori: re DefaultType, I think we've run the experiment long enough. Today's grep is typical of what I've been seeing all along (only misc services stuff): http://paste.debian.net/hidden/b18af8b8/ [15:49:09] ori: go ahead and remove the varnish hack part now? [15:49:52] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:49:52] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:50:03] Do you have graphs for the load on the event-logging infrastructure? [15:50:09] (and, optionally, if anyone thinks those files really need explicit app/os content types, go update apache somewhere to be explicit for those in the log) [15:50:12] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:50:22] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:50:32] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:50:32] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:50:32] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:50:52] PROBLEM - check configured eth on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:51:02] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [15:51:12] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [15:51:22] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [15:51:22] RECOVERY - DPKG on tungsten is OK: All packages OK [15:51:22] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [15:51:42] RECOVERY - check configured eth on tungsten is OK: NRPE: Unable to read output [15:51:42] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1302 seconds ago with 0 failures [15:51:42] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [15:51:48] rillke: https://ganglia.wikimedia.org/latest/?c=Analytics%20Kafka%20cluster%20eqiad&m=cpu_report&r=hour&s=descending&hc=4&mc=2 I think [15:52:09] <_joe_> who should I ask about a change in one of the scap tools? [15:52:19] _joe_: Me? [15:52:22] What's up? [15:52:23] <_joe_> I need to do it, in fact :) [15:53:32] <_joe_> bd808: I said "scap", I thought that triggered some alarm :P [15:53:47] So what happens if I pour coffee on the servers? [15:53:49] rillke: There are a couple... but there's nothing interesting going on AFAIS [15:54:26] _joe_: It pings me but hopefully doesn't set off warning alarms these days ;) [15:54:27] Bsadowski1: coffee.wikimedia.org [15:54:31] Bsadowski1: they might die... or not [15:55:14] friend of mine poured a whole pot of tea over two desktops and they stayed on [15:56:43] _joe_: Adding that script seems reasonable and easy. The ::mediawiki::sync puppet class will need to be changed too to add /usr/local/bin symlinks [15:57:11] (03CR) 10BBlack: [C: 031] retab wikipedia zone and fix aligning [operations/dns] - 10https://gerrit.wikimedia.org/r/143208 (owner: 10Dzahn) [15:57:45] <_joe_> bd808: I am on it [15:57:47] (03CR) 10BBlack: [C: 031] wikimediafoundation - align and tabs [operations/dns] - 10https://gerrit.wikimedia.org/r/143212 (owner: 10Dzahn) [15:58:22] _joe_: ::mediawiki:users has a sudo grant for the old script too [15:58:26] <_joe_> bd808: btw, once we graceful-restart apaches, you will see apc metrics in graphite [15:58:33] (03CR) 10Mark Bergsma: [C: 031] retab wikipedia zone and fix aligning [operations/dns] - 10https://gerrit.wikimedia.org/r/143208 (owner: 10Dzahn) [15:58:34] <_joe_> bd808: already covered [15:58:46] <_joe_> (and that should be moved elsewhere imo) [15:59:15] * bd808 nods [15:59:18] <_joe_> bd808: does an apache-graceful-all get issued with every code release right? [15:59:24] (03CR) 10Mark Bergsma: [C: 031] wikimediafoundation - align and tabs [operations/dns] - 10https://gerrit.wikimedia.org/r/143212 (owner: 10Dzahn) [15:59:34] Nope. We hardly ever restart an apache [15:59:42] <_joe_> eh ok [15:59:57] <_joe_> I need to do that for apc to be collectible [15:59:59] The graceful-all script is root only actually I think [16:00:07] <_joe_> I hope so :) [16:00:39] scap and syncing apache-config wasnt really related so far [16:00:47] Once upon a time there was a script that allowed a deployer to graceful a single apache but it doesn't exist any more [16:01:14] Sam was looking for it a few days ago when on apache was having apc thrash issues [16:01:18] s/on/one/ [16:02:21] _joe_: is puppet-compiler fixed for puppet3? [16:02:34] Build has been executing for 6 min 59 sec on puppet-compiler02.eqiad.wmflabs [16:02:37] I guess not :) [16:02:38] bd808: _joe_ : apache-graceful is on fenari [16:02:48] <_joe_> paravoid: it was, you broke it again :P [16:02:56] <_joe_> let me check [16:02:57] :( [16:03:04] bd808: _joe_: we can move that as well to tin, so far tin just had "apache-graceful-all" [16:03:27] (03PS6) 10BBlack: Make GeoIP lookup code safer [operations/puppet] - 10https://gerrit.wikimedia.org/r/136655 (https://bugzilla.wikimedia.org/64582) (owner: 10Ori.livneh) [16:03:38] <_joe_> paravoid: you somehow always hit this crappy bug of the htmldiff lib I cannibalized [16:03:49] (03CR) 10BBlack: "^ Just manual rebase to latest prod" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136655 (https://bugzilla.wikimedia.org/64582) (owner: 10Ori.livneh) [16:03:53] <_joe_> it's the second time it happens, the second time it happens to you [16:04:00] :P [16:04:07] robla: call? :) [16:04:48] <_joe_> paravoid: I should really use something better there [16:04:50] <_joe_> sigh [16:05:34] <_joe_> paravoid: http://puppet-compiler.wmflabs.org/114/change/143887/diff/ [16:05:45] :D [16:07:54] Reedy: Let me know when it would be safe to update scap on the cluster to deploy a new script to restart the next gen twemproxy. I don't think it's any rush. [16:08:22] Lydia_WMDE: sorry, I'm late. Also having problems joining the hangout [16:08:53] robla: will invite you. maybe that helps [16:08:55] robla: Will Chris work today? [16:09:23] hoo: Chris Steipp is on vacation [16:10:29] robla: Ah, pity... when can I expect him back? Got nothing really critical, but there are a couple of htings [16:11:15] hoo: I think he's back on Monday [16:11:24] Ok, that's good enough [16:11:32] s/good/early/ [16:14:39] yo godog, yt? [16:15:10] (03PS5) 10Dzahn: Add puppet module for a tor relay (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140948 [16:15:51] Lydia_WMDE: I don't know, we may have to try Skype or something....this is getting ridiculous [16:16:05] robla: ok lydia.pintscher [16:16:15] remembers Big Blue Button from that day when Google was down [16:17:49] (03PS7) 10BBlack: Make GeoIP lookup code safer [operations/puppet] - 10https://gerrit.wikimedia.org/r/136655 (https://bugzilla.wikimedia.org/64582) (owner: 10Ori.livneh) [16:18:34] (03PS2) 10Dzahn: retab wikipedia zone and fix aligning [operations/dns] - 10https://gerrit.wikimedia.org/r/143208 [16:18:43] (03CR) 10jenkins-bot: [V: 04-1] retab wikipedia zone and fix aligning [operations/dns] - 10https://gerrit.wikimedia.org/r/143208 (owner: 10Dzahn) [16:19:49] (03PS3) 10Dzahn: retab wikipedia zone and fix aligning [operations/dns] - 10https://gerrit.wikimedia.org/r/143208 [16:21:11] (03PS4) 10Dzahn: retab wikipedia zone and fix aligning [operations/dns] - 10https://gerrit.wikimedia.org/r/143208 [16:22:06] (03PS1) 10Alexandros Kosiaris: bacula-dir reads a copy of puppet's private key [operations/puppet] - 10https://gerrit.wikimedia.org/r/143901 [16:22:20] (03PS1) 10Chad: Fix Cirrus on 11 wikis in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143902 [16:22:49] (03CR) 10Dzahn: [C: 032] retab wikipedia zone and fix aligning [operations/dns] - 10https://gerrit.wikimedia.org/r/143208 (owner: 10Dzahn) [16:23:52] _joe_: list of templates to fix? [16:23:56] (03CR) 10Chad: [C: 032] Fix Cirrus on 11 wikis in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143902 (owner: 10Chad) [16:24:03] (03Merged) 10jenkins-bot: Fix Cirrus on 11 wikis in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143902 (owner: 10Chad) [16:24:07] (03PS6) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with new packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [16:24:07] <_joe_> matanya: ouch, you're right [16:24:15] <_joe_> matanya: will do today, promised [16:24:26] ottomata: yup [16:24:30] <_joe_> godog, paravoid I surrendered :) [16:24:37] <_joe_> 1 package to rule them all [16:24:46] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [16:25:08] _joe_: ? [16:25:21] (03CR) 10jenkins-bot: [V: 04-1] nutcracker: move config in puppet, work with new packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [16:25:24] <_joe_> paravoid: to unifying nutcracker and twemproxy :) [16:25:27] I just built precise packages for you [16:25:35] <_joe_> paravoid: <3 [16:26:07] !seen hashar [16:26:10] _joe_: nice! thanks for that [16:26:23] godog, it was you i was talking to about GC tuning before, right? [16:26:29] paravoid: the absense of tasks seems to make us lose s contributer [16:26:43] or more than one [16:26:47] ottomata: yes! found anything new? [16:27:17] well, maybe [16:27:18] http://mail-archives.apache.org/mod_mbox/kafka-users/201407.mbox/%3CCAFbh0Q2f71qgs5JDNFxkm7SSdZyYMH=ZpEOxotuEQfKqeXQHfw@mail.gmail.com%3E [16:27:23] matanya: lint openstack.pp ?:) [16:27:27] see my mail there yesterday and the reply [16:27:29] matanya: i started... [16:27:54] on my list mutante [16:28:18] sadly very little spare time lately [16:28:32] matanya: heh, that was a reply to "lack of tasks" though [16:28:59] not everybody is a lint crazy like me [16:29:18] people want "real" tasks [16:29:38] <_joe_> matanya: the work you did on puppet3 was a very real task [16:29:50] <_joe_> without your help we'd still be on puppet 2.7 [16:30:01] <_joe_> 'chore' != 'not a real task' [16:30:07] <_joe_> :) [16:30:18] matanya: "Finalize and deploy ganglia_new puppet module" ? [16:30:32] that is a good one [16:30:46] i can help review and guide that too, probably [16:30:48] RT #6883 [16:31:07] i'll pass it on to relevant parties [16:31:08] godog, any thoughts on that? [16:31:19] (the kafka gc thing?) [16:31:55] thanks. i'll stick to house keeping until i have more time [16:32:17] (03PS1) 10Giuseppe Lavagetto: mediawiki-monitoring: remove unused conditional [operations/puppet] - 10https://gerrit.wikimedia.org/r/143903 [16:32:29] James_F, what to deploy? [16:32:44] MaxSem: https://gerrit.wikimedia.org/r/143896 [16:32:49] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki-monitoring: remove unused conditional [operations/puppet] - 10https://gerrit.wikimedia.org/r/143903 (owner: 10Giuseppe Lavagetto) [16:33:22] greg-g, I'm doing an emergency deployment for James_F [16:33:23] speaking of ganglia/ganglia_new: anyone already have a handle on what's involved in relocating a ganglia_aggregator role to a new host? [16:33:32] ottomata: yeah that's curious, not sure why it'd be only on one broker though [16:33:53] (03PS2) 10Dzahn: wikimediafoundation - align and tabs [operations/dns] - 10https://gerrit.wikimedia.org/r/143212 [16:34:11] yeah [16:34:24] !log apt: uploading nutcracker backport for precise [16:34:30] Logged the message, Master [16:34:42] well, godog, jgage and I thought maybe it was because of cpufreq mismatch [16:35:03] an21 was configured to use Sytem Profile = Performance Per Wat or something [16:35:12] which was causing its cpu speed to be different than other brokers [16:35:14] (03CR) 10Dzahn: [C: 032] wikimediafoundation - align and tabs [operations/dns] - 10https://gerrit.wikimedia.org/r/143212 (owner: 10Dzahn) [16:35:18] but, jgage fixed that yesterday [16:35:26] was hoping the problem would go away, but it happened again a couple of hours ago [16:36:01] ottomata: only that today though? [16:37:20] (03CR) 10Yuvipanda: "Couple of reasons I didn't do this as previlage escalation:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 (owner: 10Yuvipanda) [16:37:31] (03PS7) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [16:37:55] chasemp: mind replying on the patchset (and/or/maybe) removing the -1? :) [16:38:47] MaxSem: Do you need to sync? [16:38:52] (03CR) 10Matanya: [C: 031] replace deprecated erb template variable syntax [operations/puppet] - 10https://gerrit.wikimedia.org/r/143526 (owner: 10Dzahn) [16:38:56] !log maxsem Synchronized php-1.24wmf11/extensions/EventLogging/: bug 67420 (duration: 00m 35s) [16:38:57] (03CR) 10Dzahn: [C: 031] delete anything 'toolserver' [operations/dns] - 10https://gerrit.wikimedia.org/r/143209 (owner: 10Dzahn) [16:38:58] James_F, please verify ^^^ [16:39:02] Logged the message, Master [16:39:02] Ha. [16:39:05] Yep sorry lunch, but the consensus seemed to be subprocess and sudo is least worst [16:39:23] godog? hm? [16:39:27] YuviPanda ^ [16:39:40] (03CR) 10Dzahn: [C: 032] replace deprecated erb template variable syntax [operations/puppet] - 10https://gerrit.wikimedia.org/r/143526 (owner: 10Dzahn) [16:39:40] well, godog, i only notice it when zookeeper actually times out [16:39:53] which means that kafka hasn't talked to it in like 6 seconds or something [16:40:05] chasemp: but wouldn't sudo let diamond be puppet, which lets it read *all* the things, rather than just the things puppet has let other people read? [16:40:17] and, i only notice it because that broker then gets demoted, and no longer is leader for any partitions [16:40:26] and when any given broker's incoming traffic drops [16:40:28] icinga alerts [16:40:30] chasemp: I'm also wondering if this permission thing is a side effect of our upgrade or something, since I see no similar complaints elsewhere [16:40:39] ottomata: oh ok, I meant "only that occurence today" as opposed to several a day [16:40:49] oh, hm, no, recently it has been serveral a day [16:40:57] i betcha if i promote this broker to leader again [16:40:58] (03PS2) 10Giuseppe Lavagetto: mediawiki-monitoring: remove unused conditional [operations/puppet] - 10https://gerrit.wikimedia.org/r/143903 [16:41:04] this would happen again sometime in a few hours [16:41:06] (03CR) 10Giuseppe Lavagetto: [V: 032] mediawiki-monitoring: remove unused conditional [operations/puppet] - 10https://gerrit.wikimedia.org/r/143903 (owner: 10Giuseppe Lavagetto) [16:42:03] ottomata: did it happen some time past 14.00 utc when the network out drops? [16:42:38] and, hmmm, godog, i'm grepping the gc log for what I posted yesterday [16:42:45] where it said real=11.47 secs [16:43:08] grepping for real=1 [16:43:10] or real=2 [16:43:11] etc. [16:43:17] don't see anything bigger than real=1.16 secs [16:43:27] YuviPanda will be back in 20 or so :) [16:43:30] chasemp: ok! [16:43:51] and none of those are close to this recent zk timeout [16:44:27] hmm, i think I could work around this problem by setting the zk server socket timeout to 15ish seconds... [16:44:50] whenver I see this on the broker, it takes the broker about 11 or 12 seconds to recover from whatever happened to it [16:44:56] and by that time zk has closed the socket [16:45:44] (03CR) 10Dzahn: [C: 031] terbium: install python-mysqldb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143831 (owner: 10Matanya) [16:46:01] (03PS2) 10Matanya: terbium: install python-mysqldb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143831 [16:50:45] ottomata: yeah that could be a stopgap, I'm checking a couple of things in the logs [16:51:15] (03PS7) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with new packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [16:51:29] <_joe_> bbiab [16:52:32] _joe_: not sure if you sae the !log, packages for precise are in apt now [16:54:44] (03PS7) 10BryanDavis: [WIP] Allow puppetmaster to send reports to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) [16:56:12] wondering if/what keeps us from https://gerrit.wikimedia.org/r/#/c/143201/1/templates/wikimedia.org [16:56:14] <_joe_> paravoid: had some issues with TOR [16:56:24] <_joe_> so no, I didn't [16:56:50] <_joe_> paravoid: I need to do quite some tests before this goes live :) [16:59:58] (03PS8) 10BryanDavis: [WIP] Allow puppetmaster to send reports to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) [17:00:54] jgage: something for you maybe ? "Forgive my naiveté, but does it make sense to work with Internet Archive on this, since they already have an initiative to cache pages and to guard against link rot [9]? They're also an organization that doesn't mind adding 24 terabytes in one shot" [17:01:33] lol [17:01:40] Nemo_bis too ^^ [17:03:23] 24 TB is definitely breadcrumbs for them [17:03:55] The only trick is not to try putting 2 TB in a single item as I did [17:04:11] "Visit their lobby in San Francisco and you'll see the remnants of hundreds of 4 TB external hard drive casings" [17:04:26] (03CR) 10Ori.livneh: nutcracker: move config in puppet, work with new packages (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [17:05:27] (03CR) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with new packages (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [17:06:14] mutante: I already put a photo of those on their article [17:06:23] Well, of the boxes [17:06:37] after or before the fire? [17:06:55] mutante: where did the quote come from? [17:06:57] lol [17:06:57] Before, but that doesn't matter, the fire didn't go anywhere near the servers [17:06:59] i still need to go, just saw it from outside after it burned [17:07:02] YuviPanda: https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/Newsroom/Suggestions#Reflinks_is_dead [17:07:09] ok, good [17:07:12] ah [17:07:20] YuviPanda: https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/Newsroom/Suggestions#Reflinks_is_dead [17:07:26] mutante: I didn't make any photo, SketchCow did https://en.wikipedia.org/wiki/File:Incoming_additional_storage_at_Internet_Archive.jpg [17:07:27] mutante: yeah, reading now [17:07:50] haha, nice [17:08:11] consumer grade though? [17:08:24] re: how cheap is 24TB really [17:10:54] mutante: they just buy external disks [17:11:09] Then do what with them? [17:11:27] Take the HDD out, put on servers :) [17:11:31] What servers? [17:11:40] Labour to do that? [17:11:42] petaboxes I suppose [17:11:49] For 24TB storage, you also need more than 24TB [17:11:53] They have employes [17:12:15] They keep two copies [17:12:25] So, 48TB [17:12:45] Plus some form of raid [17:12:51] few extra drives, spares etc [17:12:57] mounts their storage on a labs instance via sshfs [17:13:20] I'm not sure it's cost efficient paying full time employees to dismantle tens of external hard drives [17:13:27] I never saw RAIDs or extra drives [17:13:49] I'm sure Brewster Kahle will welcome your operational suggestions [17:13:51] https://archive.org/~tracey/mrtg/df.html [17:13:58] That sounds pretty scary [17:14:01] (He pays with his own money, so he cares) [17:14:14] * jgage coughs [17:14:22] hehee [17:14:28] jgage: tell us more ?:) [17:14:39] well i don't want to slander [17:14:48] i worked at IA, they do things on the cheap [17:14:52] to the extreme [17:15:05] <_joe_> jgage: which is "a good thing" [17:15:12] not the way they do it [17:15:18] (03CR) 10Andrew Bogott: [C: 032] openstack-unquoted file modes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143791 (owner: 10Dzahn) [17:15:18] <_joe_> given the scale of what they try to do [17:15:34] they have a lot less servers now than when i was there :) [17:15:39] yay increasing hdd sizes [17:16:02] heh [17:16:29] <_joe_> I guessed they had home-made multi-sata-controller servers with shitloads of them, a la google at the beginning [17:16:42] <_joe_> like 100 disks per server [17:16:43] man I don't want to be involved, but is this not all academic at this point? The tool for said storage taht doesn't exist doesn't exist itself? [17:17:05] hehe consumer grade [17:17:12] when i got there, everything was desktop mini-towers [17:17:34] <_joe_> jgage: which nowadays are neither cost nor energy efficient [17:17:49] especially after a liebert failed and the room got up to 150F [17:17:49] <_joe_> If I had to run the IA, I'll use atoms probably [17:18:00] yeah we tried something like that [17:18:02] with via C3 [17:18:03] <_joe_> very small ram [17:18:09] <_joe_> a lot of disks [17:18:12] <_joe_> that's it [17:18:27] <_joe_> then if you want to be fast, you build some servers for indexing [17:18:31] sun thumpers worked out better [17:18:50] is this related to the 24TB blah blah thing? [17:18:55] Yup [17:19:05] <_joe_> robh: not really [17:19:07] OCR consumes a lot of RAM (but they have only a few dozens instances doing OCR) [17:19:08] cuz the cost of that is far more than consumer level disks, we dont use consumer grade disks in raid arrays, due to the fact they fail! [17:19:20] k [17:19:26] <_joe_> we turned that in an interesting architectural discussion [17:19:48] <_joe_> Nemo_bis: I was talking about the store and show part [17:20:01] <_joe_> I guess heritrix is running on faster boxes for sure [17:20:21] jgage: they have/had thumpers? wow [17:21:34] special nonprofit pricing i'm sure [17:22:12] (03PS1) 10Ottomata: Parameterize zookeeper.session.timeout.ms [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/143910 [17:22:22] jgage: ^ [17:22:58] * jgage looks [17:23:14] (03PS2) 10Ottomata: Parameterize zookeeper.session.timeout.ms [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/143910 [17:23:18] (03CR) 10Andrew Bogott: [C: 032] openstack - aliging and quoting [operations/puppet] - 10https://gerrit.wikimedia.org/r/143794 (owner: 10Dzahn) [17:24:29] ottomata: bah I can't find anything that jumps the eye, except that the ganglia page for the single box takes a while to load now :( [17:25:36] ha, yeah, lots of metrics there :/ [17:25:57] (03CR) 10Tim Landscheidt: [C: 04-1] "Only the user accounts (apart from Merlissimo's and OSM (?)) have been disabled; the servers itself are still in use! Do not delete the D" [operations/dns] - 10https://gerrit.wikimedia.org/r/143209 (owner: 10Dzahn) [17:26:53] (03CR) 10Aaron Schulz: "Sorry, I didn't get to look at this yesterday. Looks fine to me." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 (owner: 10Filippo Giunchedi) [17:27:32] jgage: next step would be to assemble the thumpers-like things from backblaze [17:28:05] yeah, i think they may have some of those [17:28:19] sadly everyone i knew at IA has left. which tells you something. [17:29:21] (03CR) 10Gage: [C: 032] "discussed on IRC" [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/143910 (owner: 10Ottomata) [17:29:23] YuviPanda: so few things. I do suspect that it might possible be odd upgrade behavior, but he labs host I'm looking which is precise to test is 2.7 and has the odd perms, and seems bent on not upgrading. 2. that isn't actually how sudo works in this case, yes you can say allow user x to issue as user y, but generally sudo perms exist in a vacuum of "this one thing is priviledged". Thus allowing the read of tha [17:29:23] t file presents not risk of the diamond user jumping the shark and becoming or acting as puppet user, or anyone else [17:30:34] riddle me this: Why does gerrit highlight the first whitespace diff correctly but the second incorrectly? Or is the second somehow marked correctly in a way the eludes me? http://bogott.net/misc/gerritdiff.png [17:30:37] (03PS1) 10Ottomata: Parameterize tickTime, initLimit and syncLimit [operations/puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/143912 [17:30:43] jgage ^ :) [17:31:12] chasemp: aaah, right. forgot we can restrict sudo to one single command. [17:33:16] <^d> andrewbogott: Because gerrit's syntax highlighter on the old change screen is bad. [17:33:49] YuviPanda: I think I'm getting puppet to upgrade here and I will relay shortly the perms outcome [17:34:08] ^d: I don't understand… are you saying something more specific than 'because gerrit is broken'? [17:35:18] !next [17:36:05] (03CR) 10Gage: [C: 032] Parameterize tickTime, initLimit and syncLimit [operations/puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/143912 (owner: 10Ottomata) [17:36:11] YuviPanda: I upgraded a long puppet disabled host using just a normal run [17:36:22] perms: -rw-r--r-- 1 root root 610 Jul 3 17:35 /var/lib/puppet/state/last_run_summary.yaml [17:36:41] <^d> andrewbogott: No, just a generic "it sucks and doesn't surprise me :(" [17:36:46] so I've seen trusty hosts with correct perms, and precise hosts with correct perms [17:36:49] where are the bad perms? [17:36:54] ^d: ok :( [17:38:07] (03PS1) 10Reedy: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143914 [17:38:37] (03CR) 10Reedy: [C: 032] Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143914 (owner: 10Reedy) [17:38:43] (03Merged) 10jenkins-bot: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143914 (owner: 10Reedy) [17:38:50] jouncebot: next [17:38:50] In 0 hour(s) and 21 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140703T1800) [17:38:52] (03PS2) 10Reedy: Remove old static-stable [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143780 [17:38:59] (03CR) 10Reedy: [C: 032] Remove old static-stable [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143780 (owner: 10Reedy) [17:39:05] (03Merged) 10jenkins-bot: Remove old static-stable [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143780 (owner: 10Reedy) [17:40:45] panic [17:41:01] * aude thinks scribunto is broken [17:41:07] at least its tests [17:41:20] (03PS1) 10Reedy: testwiki to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143915 [17:41:22] (03PS1) 10Reedy: Wikipedias to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143916 [17:41:24] (03PS1) 10Reedy: group0 to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143917 [17:41:50] (03PS1) 10Ottomata: Set zookeeper kafka client timeout to 16 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/143918 [17:41:57] (03CR) 10Reedy: [C: 032] testwiki to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143915 (owner: 10Reedy) [17:42:17] (03Merged) 10jenkins-bot: testwiki to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143915 (owner: 10Reedy) [17:42:37] !log reedy Started scap: testwiki to 1.24wmf12 and build l10n cache [17:42:42] Logged the message, Master [17:42:52] (03PS2) 10Ottomata: Set zookeeper kafka client timeout to 16 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/143918 [17:42:52] jgage: ^ one more :) [17:43:18] aude: wmf12? [17:43:48] ottomata, looking now :) [17:44:32] (03CR) 10Gage: [C: 032] Set zookeeper kafka client timeout to 16 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/143918 (owner: 10Ottomata) [17:44:54] (03CR) 10Rush: "So to go another direction, I just upgraded a long stale precise host and it did have bad perms and puppet 2.7 to begin with, it is now 3." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 (owner: 10Yuvipanda) [17:45:04] (03CR) 10Andrew Bogott: "Yeah, agreed -- this hardware won't be decommed for a few months yet." [operations/dns] - 10https://gerrit.wikimedia.org/r/143209 (owner: 10Dzahn) [17:45:27] (03PS3) 10Ottomata: Set zookeeper kafka client timeout to 16 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/143918 [17:45:50] (03CR) 10Ottomata: [C: 032 V: 032] Set zookeeper kafka client timeout to 16 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/143918 (owner: 10Ottomata) [17:46:24] !log reloading librenms, semi-broke it with a syslog search [17:46:30] Logged the message, Master [17:47:21] (03CR) 10Dzahn: [C: 04-2] "what Tim said, only after toolserver admins approve" [operations/dns] - 10https://gerrit.wikimedia.org/r/143209 (owner: 10Dzahn) [17:47:51] (03Abandoned) 10Dzahn: delete anything 'toolserver' [operations/dns] - 10https://gerrit.wikimedia.org/r/143209 (owner: 10Dzahn) [17:48:03] (03CR) 10Andrew Bogott: [C: 032] Remove some obsolete gluster scripts: [operations/puppet] - 10https://gerrit.wikimedia.org/r/143753 (owner: 10Andrew Bogott) [17:51:14] ottomata: just checked ge-2/0/19.0 on asw-a-eqiad which is where 1021 is connected, I see many "link up" events in sequence (e.g. yesterday 22:43 utc) [17:51:18] (03PS2) 10Andrew Bogott: Add a default logfile to manage-nfs-volumes-daemon. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143763 [17:51:26] (03PS3) 10Andrew Bogott: Store a list of orphaned project volumes for later cleanup. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143765 [17:51:34] godog, hm, what's that mean? [17:51:42] link went down and then came back up? [17:51:51] i did reboot it yesterday [17:51:52] twice [17:51:54] (03PS1) 10Ottomata: Should be 16000, not 160000 for kafka zk timeouts [operations/puppet] - 10https://gerrit.wikimedia.org/r/143924 [17:51:56] oh [17:52:02] is that normal? [17:52:06] oh [17:52:07] right [17:52:17] (03CR) 10jenkins-bot: [V: 04-1] Should be 16000, not 160000 for kafka zk timeouts [operations/puppet] - 10https://gerrit.wikimedia.org/r/143924 (owner: 10Ottomata) [17:52:17] (sorry didn't notice jgage was talking) [17:52:18] ah, nevermind [17:52:20] yeah, [17:52:35] (03PS2) 10Ottomata: Should be 16000, not 160000 for kafka zk timeouts [operations/puppet] - 10https://gerrit.wikimedia.org/r/143924 [17:52:48] PROBLEM - puppet last run on analytics1025 is CRITICAL: CRITICAL: Complete puppet failure [17:52:58] PROBLEM - puppet last run on analytics1023 is CRITICAL: CRITICAL: Complete puppet failure [17:53:08] on it... [17:53:18] PROBLEM - puppet last run on analytics1024 is CRITICAL: CRITICAL: Complete puppet failure [17:53:55] !log reloading librenms, semi-broke it with a syslog search (again) [17:53:59] Logged the message, Master [17:54:15] I accidentally librenms [17:54:30] (03CR) 10Andrew Bogott: [C: 032] Add a default logfile to manage-nfs-volumes-daemon. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143763 (owner: 10Andrew Bogott) [17:54:35] (03CR) 10Andrew Bogott: [C: 032] Store a list of orphaned project volumes for later cleanup. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143765 (owner: 10Andrew Bogott) [17:54:48] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Permissions should be set either one-off with salt (please no) or with a file resource in puppet, *not* with this exec." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 (owner: 10Yuvipanda) [17:54:59] (03PS9) 10BryanDavis: Allow puppetmaster to send reports to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) [17:55:37] (03PS3) 10Ottomata: Should be 16000, not 160000 for kafka zk timeouts, and proper new zk param is sync_limit, not sync_time [operations/puppet] - 10https://gerrit.wikimedia.org/r/143924 [17:55:58] (03CR) 10jenkins-bot: [V: 04-1] Should be 16000, not 160000 for kafka zk timeouts, and proper new zk param is sync_limit, not sync_time [operations/puppet] - 10https://gerrit.wikimedia.org/r/143924 (owner: 10Ottomata) [17:56:08] (03PS4) 10Ottomata: Should be 16000, not 160000 for kafka zk timeouts, and proper new zk param is sync_limit, not sync_time [operations/puppet] - 10https://gerrit.wikimedia.org/r/143924 [17:57:13] Reedy: yes [17:57:49] https://integration.wikimedia.org/ci/job/mwext-Wikidata-testextensions-master/566/console [17:58:10] tough time debugging this, but fails wikibase repo or client + scribunto [17:58:18] RECOVERY - puppet last run on analytics1024 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:58:21] if i comment out lua parser tests, it passes [17:58:36] (03CR) 10Ottomata: [C: 032 V: 032] Should be 16000, not 160000 for kafka zk timeouts, and proper new zk param is sync_limit, not sync_time [operations/puppet] - 10https://gerrit.wikimedia.org/r/143924 (owner: 10Ottomata) [17:58:57] anomie: About? Can you have a look at Scribunto issues in wmf12 please? See what aude said above... [17:59:18] i don't know if wikibase and scribunto are interacting in a problematic way or what [17:59:28] or memory problem or somehting [17:59:40] * anomie looks [17:59:42] RECOVERY - puppet last run on analytics1023 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:59:52] RECOVERY - puppet last run on analytics1025 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:59:59] can give paste in a moment [18:00:04] Reedy, greg-g: The time is nigh to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140703T1800) [18:00:34] http://pastie.org/9351166 [18:03:42] (03PS1) 10Ottomata: zoo.cfg is rendered by zookeeper class, need new params there [operations/puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/143926 [18:04:25] (03PS2) 10Ottomata: zoo.cfg is rendered by zookeeper class, need new params there [operations/puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/143926 [18:04:57] (03CR) 10Ottomata: [C: 032 V: 032] zoo.cfg is rendered by zookeeper class, need new params there [operations/puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/143926 (owner: 10Ottomata) [18:06:28] (03PS1) 10Ottomata: sync_limit param is on zookeeper class [operations/puppet] - 10https://gerrit.wikimedia.org/r/143927 [18:06:39] (03PS2) 10Ottomata: sync_limit param is on zookeeper class [operations/puppet] - 10https://gerrit.wikimedia.org/r/143927 [18:09:47] (03CR) 10Ottomata: [C: 032 V: 032] sync_limit param is on zookeeper class [operations/puppet] - 10https://gerrit.wikimedia.org/r/143927 (owner: 10Ottomata) [18:10:04] !log reedy scap aborted: testwiki to 1.24wmf12 and build l10n cache (duration: 27m 26s) [18:10:08] Logged the message, Master [18:10:12] Damn it [18:10:17] thought my ssh session had hung [18:10:19] !log reedy Started scap: testwiki to 1.24wmf12 and build l10n cache [18:10:48] should just be l10n cache to build [18:12:31] who's greg today? [18:13:05] chrismcmahon I believe [18:13:11] k [18:13:27] greg-g: admit it, you scheduled the trains on Thursday just to stress out devs presenting at monthly metrics :) [18:15:27] if i can't figure out the test failure, then i think we'll sign up for a slot on monday to deploy to test wikidata [18:15:29] Reedy: aude I am supposed to be greg today. [18:15:50] greg-g: are you actually here? [18:15:53] some problem with lua parser tests possibly due to interaction with wkibase [18:15:57] he said not [18:16:10] marked /away too [18:16:38] 08:05 * greg-g closes IRC and won't look until Sunday night [18:17:30] Reedy, aude: Where does the testing get its LocalSettings.php? [18:17:53] for me, my local settings [18:18:07] on jenkins, default for test-extensions-master .... [18:18:41] i am trying all combinations [18:18:48] !log doing rolling restarts of zookeeper servers and kafka brokers to load up new zk timeout changes [18:18:53] Logged the message, Master [18:19:53] aude: Is Extension:SyntaxHighlight_GeSHi installed, and Scribunto configured to use it? [18:20:11] jenkins uses default settings [18:20:25] * aude checking my install [18:21:01] that could be the issue [18:23:14] what is weird is that https://gerrit.wikimedia.org/r/#/c/143864/ passed earlier today [18:23:24] we are using the same commit of wikibase for our deployment build [18:26:17] Is someone running a maintenance script to do uplaods to commons? [18:26:26] wRaelBot: User 127.0.0.1 triggered filter ("Prevent massive uploads by brand new users") with the action "upload" on [[File:DETAIL VIEW OF OPERATOR'S CABIN, LOOKING WEST-NORTHWEST - Sacramento River Bridge, Spanning Sacramento River at California State Highway 275, Sacramento, Sacramento County, CA HAER CAL,34-SAC,58-16.tif]]; actions: tag. Diff: [18:26:46] Lots of warnings and alarm bells going off everywhere in the patrolling channels right now [18:26:46] anomie: that fixes for me... i had commented out all my extensions except wikibase, scribunto but had $wgScribuntoUseGeSHi = trueu [18:26:55] on jenkins no idea [18:27:09] could be different problem [18:27:28] jouncebot: next [18:27:28] In 4 hour(s) and 32 minute(s): SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140703T2300) [18:27:33] jouncebot: reload [18:28:21] jouncebot: refresh [18:28:23] I refreshed my knowledge about deployments. [18:28:29] aude: I see the log for https://integration.wikimedia.org/ci/job/mwext-Wikidata-testextensions-master/566/ seems to have died on a different test? [18:28:34] jouncebot: next [18:28:34] In 1 hour(s) and 31 minute(s): scap updates (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140703T2000) [18:28:40] yes [18:29:11] ls [18:29:41] Reedy: [18:29:53] I'm not [18:30:03] The localhost uploads are atributed to Fae [18:30:05] https://commons.wikimedia.org/wiki/File:Historic_American_Buildings_Survey_Photographed_by_Henry_F._Withey_December_1936_NAVE_TOWARD_CHANCEL_-_Mission_San_Diego_de_Alcala,_Misson_Valley_Road,_San_Diego,_San_Diego_HABS_CAL,37-SANDI,1-10.tif [18:30:12] Something is happening alright [18:30:41] Or is our logging just broken that it emits an RC event saying it's from 127.0.0.1 [18:31:32] https://commons.wikimedia.org/w/index.php?title=Special:AbuseLog&wpSearchUser=127.0.0.1 [18:31:52] This happened once before in January [18:31:56] https://commons.wikimedia.org/wiki/Special:Contributions/127.0.0.1 [18:32:08] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2175.91655724 [18:32:11] gwtoolset bug [18:32:13] Regression in UploadWizard / upload pool logic? [18:32:49] top - 18:32:36 up 73 days, 7 min, 2 users, load average: 9.78, 10.03, 8.40 [18:32:52] fenari is busy [18:33:18] I'm too busy right now, I'll pretend I can't see this. have fun whoever feels like looking into this, commons will be eternally greatful. [18:34:37] chasemp: hey! sorry, was in a meeting [18:34:45] Krinkle: I'll take a look through the code to see if anything obvious is sticking out [18:35:10] _joe_: not setting it with a file {} resource because there's already a file {} with /var/lib/puppet in puppetmaster.pp (or something) [18:35:18] chasemp: bad perms in all of labs [18:39:48] chasemp: tools-dev.eqiad.wmflabs (or tools-dev.wmflabs.org) if you want a sample [18:40:05] is that on it's own puppetmaster? [18:40:30] not in that project seems to bounce me [18:40:43] PROBLEM - Kafka Broker Messages In on analytics1012 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [18:40:59] ottomata? [18:41:06] !log reedy Finished scap: testwiki to 1.24wmf12 and build l10n cache (duration: 30m 47s) [18:41:12] Logged the message, Master [18:41:17] ja i'm on it [18:41:21] (03CR) 10Yuvipanda: "This sets the perms in labs to be the same as in prod everywhere, so I don't see how this is a security issue? Hopefully I didn't set the " [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 (owner: 10Yuvipanda) [18:41:22] k just checkin [18:41:30] i'm doing rolling restarts, which means that I Have to wait for a broker to come back into all ISRs before I do an election [18:41:31] chasemp: no, not on own puppetmaster [18:41:33] i'm on the last one now... [18:41:49] that took about 18 minutes for me yesterday [18:41:52] chasemp: am pretty sure it's all of labs since I see no incoming stats in graphite for hosts other than the ones where I experimentally changed perms [18:42:01] yeah, the longer it is down the longer it takes [18:42:07] its taking about 7 or 8 for me right now [18:42:09] chasemp: ah, can you tell me your wikitech username so I can give you access? [18:42:11] but i'm not rebooting the hosts [18:42:12] just kafka [18:42:13] rush [18:42:43] RECOVERY - Kafka Broker Messages In on analytics1012 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 3235.06224566 [18:43:16] <_joe_> YuviPanda: I don't really care, abstract that resource in some way if you prefer, or rethink the way you do this check, or think if we need to collect that info via diamond (in prod, we don't). I will -2 an exec of chown via puppet anyways :) [18:43:46] _joe_: eat dinner! [18:43:49] heh [18:44:04] <_joe_> ori: eheh [18:44:14] <_joe_> YuviPanda: you know I'm right :) [18:44:45] _joe_: I'll just write the hacky puppet bits that spawn out a process that cats things out as 'puppet' from the diamond user instead [18:44:50] IMO much more hacky than this [18:44:58] and even weirder, since the permissions are 'correct' in prod [18:45:03] apparently, and only fucked up in labs. [18:45:16] I think teh right thing is to figure out why perms are bad in labs [18:45:24] I agree, I didn't know that they were ok in prod [18:45:45] <_joe_> ok so, where are those perms bad in lab? everywhere? [18:46:49] _joe_: yeah [18:47:13] <_joe_> mmmh [18:48:11] (03CR) 10Reedy: [C: 032] Wikipedias to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143916 (owner: 10Reedy) [18:48:21] <_joe_> I'd say they're wrong in prod :P [18:48:24] (03Merged) 10jenkins-bot: Wikipedias to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143916 (owner: 10Reedy) [18:48:34] _joe_: haha :) [18:48:52] (03PS2) 10Reedy: Remove extension-list-1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142265 [18:48:58] (03CR) 10Reedy: [C: 032] Remove extension-list-1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142265 (owner: 10Reedy) [18:49:10] (03Merged) 10jenkins-bot: Remove extension-list-1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142265 (owner: 10Reedy) [18:49:30] <_joe_> btw, on a random mw host in prod, the dir is unreadable from users not in the puppet group [18:49:40] _joe_: oh? chasemp just told me otherwise [18:49:43] <_joe_> YuviPanda: back to square one, though [18:49:45] is it... random?! [18:49:48] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedia to 1.24wmf11 [18:49:53] Logged the message, Master [18:49:58] _joe_: which host, I checked a few new trusty installs [18:50:00] <_joe_> YuviPanda: I don't think so, anyways [18:50:03] PROBLEM - Kafka Broker Messages In on analytics1018 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [18:50:06] <_joe_> chasemp: mw1150 [18:50:15] <_joe_> what do you want exactly to achieve? [18:50:22] <_joe_> YuviPanda: ^^ [18:50:33] <_joe_> meaning, what do you want to measure? [18:50:36] <_joe_> and why? [18:50:49] _joe_: most specifically, log last puppet run delta in graphite, since we've no way in labs of monitoring puppet freshness. [18:51:01] this I am doing because we're going to experiment with alerts based on graphite polling soon [18:51:03] RECOVERY - Kafka Broker Messages In on analytics1018 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1523.17718021 [18:51:06] so that's precise with puppet 3.4 [18:51:09] perms look ok? 4 -rw-r--r-- 1 root root 561 Jul 3 18:45 /var/lib/puppet/state/last_run_summary.yaml [18:51:11] <_joe_> oh ok so you want to know if puppet is not running on a specific host in labs [18:51:13] _joe_: ^ [18:51:14] <_joe_> right? [18:51:28] _joe_: pretty much, yeah. in graphite. [18:51:34] also they want to track run times [18:51:35] chasemp: oh, did you check /var/lib/puppet? [18:51:41] <_joe_> YuviPanda: why on earth in graphite? [18:51:47] yeah, run times is good as well, but that's not my primary motivation [18:51:59] <_joe_> chasemp: /var/lib/puppet [18:52:00] 541 PHP Warning: Parameter 4 to Language::sprintfDate() expected to be a reference, value given in /usr/local/apache/common-local/php-1.24wmf11/includes/StubObject [18:52:00] .php on line 105 [18:52:01] ah ok, yes bogus [18:52:04] got it [18:52:11] _joe_: 'alerts based on graphite polling'? Icinga on labs has too many issues to be useful. [18:52:14] I hate you puppet [18:52:14] ok, jgage + others, i'm going to start takign down the analyics cluster for cdh5 reinstall [18:52:24] i will decomission the hadoop nodes in icinga, etc. fist [18:52:27] first [18:52:34] toollabs has no monitoring at the moment, we hear of trivial failures (puppet not run, disk full, host not responding) from users on mailing lists and IRC. Shouldn't be the case [18:52:39] <_joe_> YuviPanda: there are other ways to be notified if a host is failing puppet [18:53:12] <_joe_> YuviPanda: so it's toollabs, not labs even [18:53:15] _joe_: right, but if I put in graphite I can use the same thing to alert for all of them (disk space, host not responding, 5xx rates, etc) [18:53:28] YuviPanda: did you say we are managing manually /var/lib/puppet already? [18:53:29] _joe_: well, toollabs and betalabs are the ones I'm concerned about, yes. [18:53:36] <_joe_> YuviPanda: put a big if $::realm == 'labs' around that [18:53:38] chasemp: yeah, but only on puppetmasters [18:53:42] ok [18:53:43] <_joe_> I will still not agree [18:53:58] <_joe_> I mean arount the whole check [18:54:04] _joe_: chasemp now that the prod / labs perms issue is a red herring, how about the older solution? [18:54:05] <_joe_> and whatever else you want to do [18:54:22] <_joe_> YuviPanda: I don't want to collect that metric in prod [18:54:24] let diamond sudo to puppet to be able to cat just that one particular file, and the collector can do that. [18:54:26] _joe_: is not runtimes for puppet in graphite useful? [18:54:26] <_joe_> it's wrong [18:54:44] _joe_: this class is included only in labs. [18:54:57] <_joe_> YuviPanda: including the chmod? [18:55:07] _joe_: yup, role/labs.pp [18:55:22] <_joe_> YuviPanda: however, let me show you the right way to do what you want :) [18:55:34] _joe_: <3 :) [18:55:53] _joe_: also ori has vowed to -2 any realm branching in things that aren't roles [18:56:20] and I agree with that as well [18:56:46] <_joe_> me too [18:57:13] <_joe_> but I can strongarm ori in exchange for some future +1s [18:57:21] <_joe_> :P [18:57:37] hehe, feels... wrong :) [18:57:46] thoughts on the sudo solution instead? [18:57:56] I also can't believe other people haven't run into this exact issue, nothing on google [18:59:48] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:48] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:48] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:58] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:58] PROBLEM - check configured eth on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:08] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:08] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:28] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:35] poor tungsten [19:00:48] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [19:00:58] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:01:28] RECOVERY - DPKG on tungsten is OK: All packages OK [19:01:58] RECOVERY - check configured eth on tungsten is OK: NRPE: Unable to read output [19:01:58] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [19:02:03] <_joe_> YuviPanda: the fact that permissions are like that is expected. [19:02:52] _joe_: right. [19:02:57] <_joe_> I was trying to find a way to make those permissions more liberal, but I don't see how without introducing some form of security issue [19:02:58] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 765 seconds ago with 0 failures [19:02:58] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [19:03:24] <_joe_> YuviPanda: so, I'm trying to figure out a permissions set that should work and not screw everything up [19:03:38] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [19:03:38] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [19:03:38] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [19:03:50] _joe_: don't the sensitive bits inside have their own restrictive perms? [19:04:08] <_joe_> YuviPanda: that's what I'm checking :) [19:04:30] _joe_: ah, ok :) [19:05:10] <_joe_> oh my [19:07:14] _joe_: another option, I can just add code that 1. lets diamond sudo to puppet for one command only (cat ) 2. have the python code shell out with sudo as puppet to just read that, instead of open() [19:07:17] but is... ugly. [19:08:00] (03PS1) 10Ori.livneh: varnish: remove default Content-Type coercion [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 [19:08:13] <_joe_> YuviPanda: that is the other solution [19:08:29] indeed, but it's... ugly. the ssl stuff seems to be properly permissioned [19:08:30] <_joe_> which is klunky, nothing else [19:08:35] <_joe_> it is [19:08:54] and most directories inside lib/puppet seem to have +x set as well [19:09:26] (which is also why I initially suspected this to be an upgrade issue or something, not by design) [19:09:40] <_joe_> YuviPanda: so, lemme see, we want /var/lib/puppet to be 751 everywhere but on the puppetmasters, where we want it to be puppet:root, correct? [19:09:49] <_joe_> YuviPanda: I don't think so at all [19:09:59] yeah, I don't think it's an upgrade issue now. [19:10:10] <_joe_> ok the correct path of action would be [19:10:17] _joe_: 751 and owner:group are orthogonalish, no? [19:10:18] <_joe_> we change permissions in the deb package [19:10:20] (03CR) 10jenkins-bot: [V: 04-1] varnish: remove default Content-Type coercion [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 (owner: 10Ori.livneh) [19:10:29] indeed. [19:10:37] I completely agree! :D [19:10:43] but do we have ensure => latest? [19:10:47] <_joe_> yes [19:10:48] <_joe_> but [19:10:59] <_joe_> on trusty, we use ubuntu's packages [19:11:06] <_joe_> on precise, I backported them [19:11:23] <_joe_> It's a bit of work [19:11:38] right. I can raise this upstream as well. [19:11:42] (03CR) 10Ori.livneh: "Chris, Tim: Brandon suggested that we remove this code based on a review of the logs for the past ten days. I recall that you two were con" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 (owner: 10Ori.livneh) [19:11:53] _joe_: is changing it for the backport a bit of work as well? [19:12:06] <_joe_> YuviPanda: eh. [19:12:13] err [19:12:13] I mean [19:12:16] just the backport [19:12:20] oh [19:12:23] nevermind, I'm an idiot [19:12:32] <_joe_> YuviPanda: backporting meant rebuilding some 10 pakages [19:12:40] !log reedy Synchronized php-1.24wmf11/languages/Language.php: I039547b867b2eab47692dcc018c95b89975bc65d (duration: 00m 40s) [19:12:46] Logged the message, Master [19:12:48] _joe_: yeah, that just hit me right before my 'oh' [19:12:48] <_joe_> hopefully, this time would just be the puppet one :) [19:12:59] :) [19:13:14] <_joe_> but I'm still convinced you can go the sudo path now [19:13:17] <_joe_> just in labs [19:13:28] _joe_: right, this role is included just in labs. [19:13:44] <_joe_> sorry to be such a PITA [19:14:00] <_joe_> but I really want to keep antipatterns from spewing in the code [19:14:09] _joe_: :) I understand, and frankly I don't think I'd learn much without people being PITAs [19:14:12] <_joe_> and a chmod exec from puppet is plainly wrong [19:14:34] (03PS2) 10Reedy: group0 to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143917 [19:14:36] _joe_: so, to summarize: 1. proper solution is to change this in the debs, 2. for now we will kludge python, 3. this is on labs only (which it is) [19:14:39] (03CR) 10jenkins-bot: [V: 04-1] group0 to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143917 (owner: 10Reedy) [19:14:40] <_joe_> YuviPanda: you could also sidetrack me and use an ensure_resource('file',...) [19:14:43] <_joe_> ;) [19:14:44] at this point is it less work to have puppet copy a user readable copy of it's status file somewhere we are ok with people querying at the end of the run [19:14:49] I mean for the meantime part [19:15:04] <_joe_> YuviPanda: also, labs puppetmasters do have some is_puppet_master variable somewhere [19:15:06] (03CR) 10Reedy: [C: 032 V: 032] group0 to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143917 (owner: 10Reedy) [19:15:15] <_joe_> chasemp: or that [19:15:27] <_joe_> chasemp: yeah we can do that at stage::last [19:15:34] I like making a shareable copy more than mucking up the existing perms [19:15:36] _joe_: yeah, but I guess I'll have to use ensure_resource in both the labs puppetmasters as well. [19:15:36] I guess [19:15:38] <_joe_> if we're sure it gets written to before [19:15:40] if that's how it has to go down [19:15:49] <_joe_> YuviPanda: you won't [19:15:57] oh? [19:16:07] <_joe_> YuviPanda: you define your puppetmaster class in labs to go before => diamond [19:16:09] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf12 [19:16:11] does ensure_resource(file) not conflict with file{} [19:16:14] Logged the message, Master [19:16:16] <_joe_> if that's possible [19:16:20] _joe_: but this is base puppetmaster, I don't think it's just for labs [19:16:36] let me verifyu [19:16:38] <_joe_> YuviPanda: yes, ok, that class will be included somewhere [19:16:40] <_joe_> right? [19:16:45] aaah [19:16:50] right, that makes sense. [19:17:01] <_joe_> anyway, you can either use the $is_puppet_master as a guard [19:17:16] <_joe_> or chain the puppet master class for labs to go before diamond [19:17:32] i think we need to go through the entire /h/w/bin on fenari and agree what can be deleted, what needs to be puppetized and what is already on tin [19:17:34] <_joe_> so you're sure the ensure_resource there works [19:17:48] (03PS2) 10Reedy: Remove extension-list-1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142266 [19:17:49] <_joe_> and now, I'm off for good for some time [19:17:52] (03CR) 10Reedy: [C: 032] Remove extension-list-1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142266 (owner: 10Reedy) [19:17:58] _joe_: :) ok [19:17:59] (03Merged) 10jenkins-bot: Remove extension-list-1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142266 (owner: 10Reedy) [19:18:17] <_joe_> YuviPanda: comment everything or in 3 months we'll bleed from the eyes when we see those dependencies :) [19:18:19] just some examples to give you an idea: ./bin/kill-stuck-apaches , ./bin/apache-restart-all-hard ./l10update2 ./update-special-pages-small ./wikiuser_pass_real :P [19:18:41] _joe_: oh yeah, comments large enough to send machines into swap... :) [19:19:54] (03PS1) 10Reedy: Only load Nostalgia skin from $IP/skins/Nostalgia/Nostalgia.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144002 [19:20:02] /home/wikipedia/bin/fail/reboot wtf :) [19:20:17] (03CR) 10Reedy: [C: 032] Only load Nostalgia skin from $IP/skins/Nostalgia/Nostalgia.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144002 (owner: 10Reedy) [19:20:19] _joe_ YuviPanda thanks for hashing that guys, sorry I flagged out there, got pulled to the side [19:20:22] _joe_: chasemp btw, puppetmaster/manifests/ssl.pp defines /var/lib/puppet to be 0751 [19:20:23] (03Merged) 10jenkins-bot: Only load Nostalgia skin from $IP/skins/Nostalgia/Nostalgia.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144002 (owner: 10Reedy) [19:20:25] hopefully this is workable [19:20:28] (03CR) 10Chad: "Long as it's on all wikis now :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144002 (owner: 10Reedy) [19:21:05] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [19:21:09] Logged the message, Master [19:22:30] (03CR) 10Reedy: "Yup :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144002 (owner: 10Reedy) [19:24:53] (03PS2) 10Reedy: Add VisualEditor to Wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143659 (owner: 10Withoutaname) [19:24:58] (03CR) 10Reedy: [C: 032] Add VisualEditor to Wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143659 (owner: 10Withoutaname) [19:26:10] Yay. [19:26:16] (03Merged) 10jenkins-bot: Add VisualEditor to Wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143659 (owner: 10Withoutaname) [19:26:46] (03PS2) 10Reedy: Enable TemplateData GUI for English, French and Italian Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143360 (https://bugzilla.wikimedia.org/67376) (owner: 10Jforrester) [19:26:55] (03CR) 10Reedy: [C: 032] Enable TemplateData GUI for English, French and Italian Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143360 (https://bugzilla.wikimedia.org/67376) (owner: 10Jforrester) [19:27:05] !log disabling puppet on hadoop related analytics nodes, preparing for reinstall [19:27:10] Logged the message, Master [19:27:41] (03Merged) 10jenkins-bot: Enable TemplateData GUI for English, French and Italian Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143360 (https://bugzilla.wikimedia.org/67376) (owner: 10Jforrester) [19:27:55] (03CR) 10Reedy: "11 and 12 as of today" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) (owner: 10Reedy) [19:28:25] (03CR) 10Reedy: "is this waiting on anything in particular?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [19:30:03] (03CR) 10Jforrester: "Go for it." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [19:30:10] (03PS3) 10Reedy: Kill $wgEnableNewpagesUserFilter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141100 (https://bugzilla.wikimedia.org/58932) (owner: 10TTO) [19:30:16] (03CR) 10Reedy: [C: 032] Kill $wgEnableNewpagesUserFilter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141100 (https://bugzilla.wikimedia.org/58932) (owner: 10TTO) [19:30:56] (03Merged) 10jenkins-bot: Kill $wgEnableNewpagesUserFilter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141100 (https://bugzilla.wikimedia.org/58932) (owner: 10TTO) [19:30:58] Krinkle: https://gerrit.wikimedia.org/r/139569 Can that just go out? [19:30:59] (03PS1) 10Matanya: platform: simplify hardware specific configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/144033 [19:31:12] Reedy: No [19:31:18] Definitely not any time soon (months) [19:31:24] Krinkle: On test wikis? [19:31:27] Why is that even being worked on? [19:31:34] :P [19:31:49] It needs -1/-2/abandoning then [19:31:52] (03PS2) 10Reedy: Tweak wgPropertySuggesterMinProbability to 0.71, from 0.8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143871 (owner: 10Aude) [19:31:55] James_F: Maybe on test2 wiki [19:31:57] (03CR) 10Reedy: [C: 032] Tweak wgPropertySuggesterMinProbability to 0.71, from 0.8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143871 (owner: 10Aude) [19:32:05] (03Merged) 10jenkins-bot: Tweak wgPropertySuggesterMinProbability to 0.71, from 0.8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143871 (owner: 10Aude) [19:32:15] but doing it mw.org is just irresponsible at this point. [19:32:27] user groups of global users will just flat out break. We haven't even began emitting deprecation notices for these [19:32:27] Krinkle: Phase 0 seems fine, surely. [19:32:27] Krinkle: How else will gadget authors find the issues? [19:32:40] Not this way, not now [19:32:44] (03PS2) 10Reedy: Re-enable the anonymous signup invite experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143614 (owner: 10Phuedx) [19:32:56] (03CR) 10Reedy: [C: 032] Re-enable the anonymous signup invite experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143614 (owner: 10Phuedx) [19:33:03] (03CR) 10Jforrester: [C: 04-1] Disable $wgLegacyJavaScriptGlobals for "group0" wikis (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [19:33:05] (03Merged) 10jenkins-bot: Re-enable the anonymous signup invite experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143614 (owner: 10Phuedx) [19:33:05] Krinkle: OK. [19:34:38] (03PS2) 10Reedy: Allow subpages in template namespace of frwikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142980 (https://bugzilla.wikimedia.org/57487) (owner: 10TTO) [19:34:42] (03CR) 10Reedy: [C: 032] Allow subpages in template namespace of frwikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142980 (https://bugzilla.wikimedia.org/57487) (owner: 10TTO) [19:34:59] (03Merged) 10jenkins-bot: Allow subpages in template namespace of frwikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142980 (https://bugzilla.wikimedia.org/57487) (owner: 10TTO) [19:35:14] (03PS3) 10Reedy: Add suppressredirect user group for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142979 (https://bugzilla.wikimedia.org/67278) (owner: 10TTO) [19:35:18] (03CR) 10Reedy: [C: 032] Add suppressredirect user group for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142979 (https://bugzilla.wikimedia.org/67278) (owner: 10TTO) [19:35:26] (03Merged) 10jenkins-bot: Add suppressredirect user group for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142979 (https://bugzilla.wikimedia.org/67278) (owner: 10TTO) [19:35:53] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [19:36:13] !log rebooting analytics1012 for bios change: cpufreq governor [19:36:15] (03CR) 10Krinkle: [C: 04-1] "Gadget authors have not yet been given a fair chance to migrate their scripts. Disabling this now will needlessly upset users and break ac" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [19:36:19] Logged the message, Master [19:36:20] (03CR) 10Krinkle: "Gadget authors have not yet been given a fair chance to migrate their scripts. Disabling this now will needlessly upset users and break ac" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [19:36:23] PROBLEM - Puppet freshness on cp1037 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 19:33:46 UTC [19:36:29] (03CR) 10Krinkle: "https://bugzilla.wikimedia.org/show_bug.cgi?id=65011#c8" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [19:36:31] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120171 (owner: 10Jforrester) [19:36:45] (03PS1) 10Dzahn: firewall for phablegal [operations/puppet] - 10https://gerrit.wikimedia.org/r/144036 [19:37:06] (03PS1) 10Mattflaschen: Revert "Re-enable the anonymous signup invite experiment" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144037 [19:37:19] (03PS4) 10Reedy: New namespace "Carte" for rowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139766 (https://bugzilla.wikimedia.org/66530) (owner: 10Withoutaname) [19:37:35] (03CR) 10Reedy: [C: 032] New namespace "Carte" for rowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139766 (https://bugzilla.wikimedia.org/66530) (owner: 10Withoutaname) [19:37:42] namespaces à la carte [19:37:43] (03CR) 10Mattflaschen: "Sorry for the confusion, we were not aware that changes like this are at risk of being deployed routinely absent a -1." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143614 (owner: 10Phuedx) [19:37:47] James_F: Left details on the bug [19:38:02] (03Merged) 10jenkins-bot: New namespace "Carte" for rowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139766 (https://bugzilla.wikimedia.org/66530) (owner: 10Withoutaname) [19:38:23] PROBLEM - Puppet freshness on cp1037 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 19:33:46 UTC [19:38:25] (03CR) 10Matanya: [C: 031] firewall for phablegal [operations/puppet] - 10https://gerrit.wikimedia.org/r/144036 (owner: 10Dzahn) [19:38:27] (03CR) 10Mattflaschen: [C: 032] "Not ready for deployment." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144037 (owner: 10Mattflaschen) [19:38:36] (03Merged) 10jenkins-bot: Revert "Re-enable the anonymous signup invite experiment" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144037 (owner: 10Mattflaschen) [19:38:39] (03PS6) 10Reedy: Enable VisualEditor on OTRS_wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100927 (owner: 10Jforrester) [19:38:45] (03CR) 10Reedy: [C: 032] Enable VisualEditor on OTRS_wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100927 (owner: 10Jforrester) [19:38:49] (03PS5) 10Jforrester: Create a dblist for non-Beta Features wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120171 [19:38:59] (03Merged) 10jenkins-bot: Enable VisualEditor on OTRS_wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/100927 (owner: 10Jforrester) [19:39:01] Ta. [19:39:08] Reedy: 120171 is good to go too. [19:39:09] (03CR) 10Rush: [C: 031] firewall for phablegal [operations/puppet] - 10https://gerrit.wikimedia.org/r/144036 (owner: 10Dzahn) [19:39:15] Reedy, is someone going to sync mediawiki-config soon? [19:39:28] I want to make sure the revert (https://gerrit.wikimedia.org/r/#/c/144037/) went out if the original commit did. [19:39:41] (03PS3) 10Reedy: Restore defaults for nowikibooks bureaucrats [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139428 (https://bugzilla.wikimedia.org/42105) (owner: 10Withoutaname) [19:39:42] There wasn't any sync yet AFAICS [19:39:47] Nope [19:40:21] Thanks for confirming. [19:40:32] (03CR) 10Reedy: [C: 032] Restore defaults for nowikibooks bureaucrats [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139428 (https://bugzilla.wikimedia.org/42105) (owner: 10Withoutaname) [19:40:40] (03Merged) 10jenkins-bot: Restore defaults for nowikibooks bureaucrats [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139428 (https://bugzilla.wikimedia.org/42105) (owner: 10Withoutaname) [19:40:56] (03PS6) 10Reedy: Create a dblist for non-Beta Features wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120171 (owner: 10Jforrester) [19:40:59] (03CR) 10Reedy: [C: 032] Create a dblist for non-Beta Features wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120171 (owner: 10Jforrester) [19:41:09] (03Merged) 10jenkins-bot: Create a dblist for non-Beta Features wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120171 (owner: 10Jforrester) [19:41:27] Thanks Reedy. [19:41:35] (03PS3) 10Reedy: Disable local uploads where unused, per local consensus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143825 (https://bugzilla.wikimedia.org/67453) (owner: 10Nemo bis) [19:41:40] (03CR) 10Reedy: [C: 032] Disable local uploads where unused, per local consensus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143825 (https://bugzilla.wikimedia.org/67453) (owner: 10Nemo bis) [19:41:49] (03Merged) 10jenkins-bot: Disable local uploads where unused, per local consensus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143825 (https://bugzilla.wikimedia.org/67453) (owner: 10Nemo bis) [19:42:02] Reedy: Think we should abandon https://gerrit.wikimedia.org/r/#/c/109458/ ? [19:42:54] (03PS2) 10Reedy: remove unused configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141601 (owner: 10Awight) [19:43:06] James_F: I think so yeah [19:43:13] If anywhere, it's going on labs as it stands.. [19:43:26] * James_F nods. [19:43:42] (Bah, I can't abandon as I'm not a deployer. Sigh.) [19:43:47] (03CR) 10Reedy: [C: 032] remove unused configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141601 (owner: 10Awight) [19:43:53] (03Merged) 10jenkins-bot: remove unused configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141601 (owner: 10Awight) [19:45:10] (03PS1) 10Rush: don't require second approval for mediawiki accounts [operations/puppet] - 10https://gerrit.wikimedia.org/r/144045 [19:45:38] (03CR) 10Reedy: [C: 04-1] Delete ve.wikimedia.org and leave redirect (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) (owner: 10Withoutaname) [19:45:53] (03Abandoned) 10Reedy: Set up wiki.toolserver.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/109458 (https://bugzilla.wikimedia.org/60222) (owner: 10Tim Landscheidt) [19:47:24] !log reedy Synchronized database lists: (no message) (duration: 00m 20s) [19:47:30] Logged the message, Master [19:47:50] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [19:48:06] !log reedy Synchronized docroot and w: (no message) (duration: 00m 30s) [19:48:10] Logged the message, Master [19:48:34] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [19:48:39] (03PS2) 10Jforrester: Organize BetaFeatures labs overrides cleanly [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120168 (owner: 10Spage) [19:49:04] heh [19:49:30] (03CR) 10Jforrester: [C: 031] Organize BetaFeatures labs overrides cleanly [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120168 (owner: 10Spage) [19:50:30] (03PS3) 10Reedy: Organize BetaFeatures labs overrides cleanly [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120168 (owner: 10Spage) [19:50:35] (03CR) 10Reedy: [C: 032] Organize BetaFeatures labs overrides cleanly [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120168 (owner: 10Spage) [19:50:42] (03Merged) 10jenkins-bot: Organize BetaFeatures labs overrides cleanly [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120168 (owner: 10Spage) [19:51:22] (03CR) 10Reedy: [C: 04-1] "Per not to be deployed yet" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143750 (owner: 10MarkTraceur) [19:51:42] (03CR) 10Rush: [C: 032 V: 032] don't require second approval for mediawiki accounts [operations/puppet] - 10https://gerrit.wikimedia.org/r/144045 (owner: 10Rush) [19:52:31] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115880 (owner: 10Nemo bis) [19:52:55] I think that's about it... [19:53:10] RECOVERY - Puppet freshness on cp1037 is OK: puppet ran at Thu Jul 3 19:53:05 UTC 2014 [19:53:16] hey cmjohnson1, yt? [19:53:27] !log reedy Synchronized wmf-config/InitialiseSettings-labs.php: (no message) (duration: 00m 14s) [19:53:31] Logged the message, Master [19:53:37] * Nemo_bis tests [19:53:53] ottomata: i am [19:54:12] It worked [19:54:21] (03PS1) 10Rush: update phab tag for legalpad [operations/puppet] - 10https://gerrit.wikimedia.org/r/144048 [19:54:30] i'm resinstalling the current hadoop boxes [19:54:34] getting this on analytics1020 [19:54:36] (03CR) 10Rush: [C: 032 V: 032] update phab tag for legalpad [operations/puppet] - 10https://gerrit.wikimedia.org/r/144048 (owner: 10Rush) [19:54:38] Warning! Power management firmware initialization error. [19:54:38] Disconnect and reconnect system input power. [19:54:38] Strike the F1 key to continue, F2 to run the system setup program [19:56:54] reinstalling hadoop? good time to merge volunteer lint change ? https://gerrit.wikimedia.org/r/#/c/142543/ [19:57:07] ottomata...can you hit f1 for now I am in Tampa and will not be back until Monday to troubleshoot [19:57:12] we have 5 branches in operations/puppet [19:57:31] mutante: have time to review a patch ? [19:57:42] production, analytics/kraken, sandbox/hashar/zuulintegration, test and testlabs/jenkins [19:58:00] matanya: depends ?:) [19:58:19] https://gerrit.wikimedia.org/r/144033 [19:58:31] it seems simple enough to me [19:58:39] but i might be wrong [19:59:37] chrismcmalunch: https://wikitech.wikimedia.org/wiki/Deployments#Week_of_July_7th [20:00:05] bd808: The time is nigh to deploy scap updates (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140703T2000) [20:00:13] ok coool, i just wan't sure what that meant, but can do [20:00:27] analytics/kraken can certainly be scrapped, mutante [20:00:27] assume we figure out problem we want to deploy on monday [20:00:46] test2 and test wikidata [20:01:16] (03CR) 10Withoutaname: Delete ve.wikimedia.org and leave redirect (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) (owner: 10Withoutaname) [20:01:20] think i have some idea [20:01:46] ottomata: it means there is a problem with the firmware version or the power supply itself [20:02:05] ok cool, so node is fine, just power issue that shoudl be fixed [20:02:06] cool [20:02:47] matanya: sorry, i dont wanna touch that now (modules/base etc...) [20:02:57] fair enough [20:03:14] mutante: maybe https://gerrit.wikimedia.org/r/#/c/143831/ then? :) [20:03:16] breaking puppet before long vacation isn't smart [20:03:51] i have already given a +1, now find another to merge and you can call it consensus:) [20:03:57] legoktm: [20:03:58] ah, ok [20:05:28] mutante: i rebased it, if you don't mind +1 again [20:05:48] (03CR) 10Withoutaname: "Just making sure, for this change we are only doing bug 65011." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) (owner: 10Withoutaname) [20:05:55] (03CR) 10Dzahn: [C: 031] terbium: install python-mysqldb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143831 (owner: 10Matanya) [20:06:03] yea, i know that part about gerrit can be annoying [20:06:15] (how votes disappear after rebase) [20:06:20] yes [20:06:21] but it could technically also be broken [20:07:01] i think the unwritten policy is that they still count [20:08:26] chasemp: sure the phabricator.base-uri should be http (not https or without protocol) ? [20:09:10] Reedy: All clear for me to update the scap scripts? [20:10:33] <^d> mutante: They should be copied on trivial rebase now. [20:10:47] <^d> If it was more complicated and had to be manually resolved, you're out of luck :\ [20:10:49] bd I think so [20:10:55] coolio [20:11:01] tab fail there [20:11:24] ^d: ah! well, that makes sense i'd say, if it was a manual rebase it can be broken [20:11:35] but if it was just hitting the UI button ...should be fine [20:11:43] <^d> Yeah it should copy then. [20:11:48] cool [20:11:50] <^d> Also on just updating commit summary, I think we set. [20:11:58] nice [20:12:46] !log Updated scap to ff04431 (restart-nutcracker script) [20:12:51] Logged the message, Master [20:13:12] ori, _joe_: ^ restart-nutcracker is on the cluster now [20:13:59] chasemp: ignore, i was referring to the unmerged change [20:15:54] (03PS2) 10Withoutaname: Disable $wgLegacyJavaScriptGlobals for test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) [20:19:39] (03PS2) 10Rush: firewall for phablegal [operations/puppet] - 10https://gerrit.wikimedia.org/r/144036 (owner: 10Dzahn) [20:19:46] (03CR) 10Rush: [C: 032 V: 032] firewall for phablegal [operations/puppet] - 10https://gerrit.wikimedia.org/r/144036 (owner: 10Dzahn) [20:21:38] (03PS2) 10Dzahn: Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [20:23:18] (03PS3) 10Dzahn: Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [20:30:37] (03PS4) 10Dzahn: Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [20:31:00] (03CR) 10BryanDavis: "You should be able to apply this change right now via:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143754 (owner: 10Chad) [20:31:52] (03CR) 10Rush: Phabricator for iridium (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [20:33:37] (03CR) 10Dzahn: "updated mysql host/user/pass (but don't we want more than one, one user per phab instance?), updated mail server to polonium, fixed stmp t" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [20:35:42] (03PS3) 10Withoutaname: Disable $wgLegacyJavaScriptGlobals for test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139569 (https://bugzilla.wikimedia.org/65011) [20:37:23] (03CR) 10Chad: [C: 031] "Config already live on cluster per Bryan, this can go out anytime without a restart." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143754 (owner: 10Chad) [20:37:38] (03PS5) 10Dzahn: Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [20:38:43] (03PS6) 10Dzahn: Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [20:39:55] <^d> ottomata: Could I get a merge on a puppet change for Elastic? [20:40:56] (03PS3) 10Ottomata: Prevent massively destructive actions against Elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/143754 (owner: 10Chad) [20:41:01] (03CR) 10Ottomata: [C: 032 V: 032] Prevent massively destructive actions against Elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/143754 (owner: 10Chad) [20:41:01] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 20:38:28 UTC [20:41:19] <^d> thx! [20:41:22] yup! [20:41:52] bblack: hey, I noticed some strange behavior withe Parsoid Varnishes & might need some help figuring out the reason for it [20:42:19] can I steal some of your time? [20:42:26] how to delete entire branch from operations/puppet ? [20:43:01] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 20:38:28 UTC [20:43:18] (if it's complicated or risky, not that important :p) [20:43:26] <^d> mutante: Easiest way is via gerrit UI. [20:43:50] ^d: 13:01 < ottomata> analytics/kraken can certainly be scrapped, mutante [20:43:56] <^d> https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet,branches [20:45:01] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 20:38:28 UTC [20:45:24] !log deleted analytics/kraken branch from ops/puppet via gerrit ui, ack'ed by ottomata [20:45:30] Logged the message, Master [20:45:31] ^d: thanks! worked [20:45:31] (03PS1) 10Matanya: vimrc: remove from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/144059 [20:45:46] <^d> mutante: you're welcome [20:47:01] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 20:38:28 UTC [20:49:01] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 20:38:28 UTC [20:51:01] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 20:38:28 UTC [20:52:37] (03PS12) 10Withoutaname: Delete ve.wikimedia.org and leave redirect [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) [20:52:42] (03CR) 10jenkins-bot: [V: 04-1] Delete ve.wikimedia.org and leave redirect [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) (owner: 10Withoutaname) [20:53:01] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 20:38:28 UTC [20:55:01] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 20:38:28 UTC [20:55:52] (03PS13) 10Withoutaname: Delete ve.wikimedia.org and leave redirect [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) [20:56:04] <^d> icinga-wm: Why you say that? puppet logs on search1021 look fine. [20:57:01] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 03 Jul 2014 20:38:28 UTC [20:58:31] RECOVERY - Puppet freshness on search1021 is OK: puppet ran at Thu Jul 3 20:58:27 UTC 2014 [21:04:39] i'm trying to transwiki some pages from enwiki to testwiki, and i get " Import failed: Expected tag, got ". since empty responses usually mean fatals, i suspect that's what's happening, but i have no clue why [21:06:32] oblivian is doing a graceful restart of all apaches [21:07:51] !log oblivian gracefulled all apaches [21:07:55] Logged the message, Master [21:15:15] where do those logmsgbot messages originate from? [21:16:27] jgage: servers [21:16:51] <_joe_> jgage: from tin in this case [21:17:03] thanks. [21:17:04] * jgage looks [21:17:13] bblack: sent a mail to the ops list [21:17:20] echo "$*" | nc -q0 neon.wikimedia.org 9200 [21:17:22] essentially [21:18:23] _joe_: that seems to have worked :) [21:19:05] <_joe_> mutante: it did [21:19:10] <_joe_> thanks again [21:28:29] (03PS1) 10Andrew Bogott: Add archive-project-volumes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144063 [21:29:25] (03PS1) 10coren: Labs: add sync-exports to repo [operations/puppet] - 10https://gerrit.wikimedia.org/r/144064 [21:29:42] andrewbogott: ^^ [21:29:42] (03PS2) 10Andrew Bogott: Add archive-project-volumes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144063 [21:30:20] andrewbogott: Oh, wait. Don't +2 that yet [21:30:30] ok [21:30:42] Going to diff it with the tool that's actually running in prod? [21:32:07] (03PS2) 10coren: Labs: add sync-exports to repo [operations/puppet] - 10https://gerrit.wikimedia.org/r/144064 [21:32:56] No, I realized at the last minute that I hadn't filled in the source and from bits right from my template. [21:33:02] (03CR) 10Andrew Bogott: [C: 032] Labs: add sync-exports to repo [operations/puppet] - 10https://gerrit.wikimedia.org/r/144064 (owner: 10coren) [21:33:10] oops too soon [21:33:15] No, now it's okay. [21:34:11] I always put a From: and Source: in the comments of stuff that's managed by puppet that I add. Easier to track them down. [21:35:13] yep, I try to too but forget 70% of the time [21:35:20] Anyway, merged, the only diff was in the ##headers [21:38:02] (03CR) 10Ori.livneh: [C: 032] "(trivial)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143599 (owner: 10Hoo man) [21:39:02] (03CR) 10coren: [C: 032] "I'm not a fan of single-host special casing as a rule, but since this is a temporary thing I suppose it's not horrid." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143831 (owner: 10Matanya) [21:39:08] - add base::firewall and ferm rules -> get iptables rules from puppet. manually flush them, run puppet again.. they dont come back ?! [21:39:24] (03PS3) 10coren: terbium: install python-mysqldb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143831 (owner: 10Matanya) [21:41:10] (03CR) 10coren: [C: 032] "Still +2 after rebase." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143831 (owner: 10Matanya) [22:00:11] chrismcmahon: got my note? [22:00:52] problem was http://tapestryjava.blogspot.de/2010/07/git-on-mac-os-x-dont-ignore-case.html and renaming of some of our classes :/ [22:00:54] aude: no, [22:01:10] i signed up for slot on monday to update our code on test2 / test wikidata [22:01:22] since we were not ready today due to that problem [22:01:32] aude: ah, OK, thanks [22:01:46] i can mail greg but don't want surprise [22:01:48] aude: I did see that you postponed to Monday [22:01:52] yep [22:02:20] now on, i'll only make these on ubuntu [22:02:27] aude: please do mail Greg, I'll be out myself on Monday [22:02:32] ok [22:02:46] i'm sur enot a problem [22:16:25] (03CR) 10Dzahn: [C: 031] "source says to wait for RT: 7618, that has been merged into RT: 2636 which is resolved. also nowadays everybody can put their own .vimrc i" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144059 (owner: 10Matanya) [22:18:11] (03PS2) 10Dzahn: vimrc: remove from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/144059 (owner: 10Matanya) [22:21:29] Commons is still getting flooded by warning sform 127.0.0.1 by the way [22:21:33] wRaelBot: User 127.0.0.1 triggered filter ("Prevent massive uploads by brand new users") with the action "upload" on [[File:FOUNDATION PLAN AND FOOTING DETAILS OF PROSCENIUM ARCH - Whittier Theatre, 11602-11612 East Whittier Boulevard, Whittier, Los Angeles County, CA HABS CAL,19-WHIT,2-48.tif]]; actions: tag. Diff: [22:21:43] Reedy: Did you figure out what it was? (not sure whether it was you looking at it) [22:21:49] Is a regression from today [22:21:56] Or at least recently and first triggered today [22:22:06] https://commons.wikimedia.org/wiki/Special:Contributions/127.0.0.1 [22:22:07] I didn't look [22:22:12] https://commons.wikimedia.org/w/index.php?title=Special:AbuseLog&wpSearchUser=127.0.0.1 [22:22:13] It's not a regression from today [22:22:25] bawolff said it was a GWToolset bug [22:23:00] Which sort of makes sense [22:23:08] As the uploads happen from the job runners [22:23:57] But obviously should probably be attributed to the instigator [22:26:43] Filed https://bugzilla.wikimedia.org/show_bug.cgi?id=67504 [22:28:16] Krinkle: can you disable the abuse filter for localhost? [22:29:46] Krinkle: you link to Special:Contributions doesn't work does it [22:30:00] Nemo_bis: works fine? [22:30:08] Shows: [22:30:09] * (change visibility) 11:56, 7 May 2014 (diff | hist) . . (0)‎ . . File:Bor-Nederlantsche-Oorloghen 9168.tif ‎ (Hansmuller uploaded a new version of File:Bor-Nederlantsche-Oorloghen 9168.tif) [22:30:35] Well you prefixed it with "today" [22:30:43] ? [22:30:58] Today is not 7 May AFAIK [22:31:11] Right [22:31:16] It's only in recentchanges [22:31:19] not in revision, so not in contributions [22:32:48] Maybe it has user_id != 0 ? [22:33:14] It's only for the upload action, not the creation of the file description page [22:33:17] that's also weird [22:33:28] they're separate transactions in mediawiki [22:33:48] there used to be an exploit where toolserver could do the upload, but not the file page [22:33:57] so users had to create that manually copy/pasting the suggested wikitext [22:35:36] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures [22:38:49] (03PS8) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [22:48:19] (03PS5) 10Dzahn: search - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137996 (owner: 10Rush) [22:48:47] (03CR) 10Dzahn: [C: 031] search - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137996 (owner: 10Rush) [22:50:11] (03PS5) 10Dzahn: parsoid - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137997 (owner: 10Rush) [22:50:30] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [22:50:37] (03CR) 10Dzahn: [C: 031] parsoid - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137997 (owner: 10Rush) [22:51:00] (03CR) 10Phe: [C: 031] Add pr_index table from Proofread Page extension [operations/software] - 10https://gerrit.wikimedia.org/r/143622 (owner: 10Reedy) [22:52:40] (03PS4) 10Dzahn: fundraising, replace generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/138007 (owner: 10Rush) [22:53:06] (03CR) 10Dzahn: [C: 031] fundraising, replace generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/138007 (owner: 10Rush) [22:54:56] (03PS4) 10Dzahn: nfs - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138003 (owner: 10Rush) [22:55:26] (03CR) 10Dzahn: [C: 031] nfs - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138003 (owner: 10Rush) [22:56:18] (03CR) 10Dzahn: [C: 031] Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [22:56:22] (03PS7) 10Dzahn: Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [22:56:52] (03CR) 10Dzahn: [C: 031] Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [22:59:59] (03PS5) 10Dzahn: openstack-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138002 (owner: 10Rush) [23:00:05] RoanKattouw, mwalker, ori, MaxSem, RoanKattouw: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140703T2300) [23:01:13] (03CR) 10jenkins-bot: [V: 04-1] openstack-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138002 (owner: 10Rush) [23:01:36] greg-g, are we doing swat today? [23:01:40] if so I can take it [23:01:41] but... [23:01:45] if not; then [23:02:03] greg-g isn't here [23:02:11] There's one patch listed attributed to RoanKattouw... [23:02:24] (03PS6) 10Dzahn: openstack-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138002 (owner: 10Rush) [23:02:36] yep; did you do the train deploys earlier? [23:02:43] if so I'll assume I can push things out [23:02:45] Yeah [23:02:48] kk [23:02:55] chrismcmahon is acting greg-g [23:03:12] (03CR) 10Dzahn: [C: 031] openstack-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138002 (owner: 10Rush) [23:03:35] James_F, did you cherry pick https://gerrit.wikimedia.org/r/#/c/144077/ from somewhere [23:03:41] or is that a totally new patch? [23:04:43] mwalker: It's a cherry-pick from master [23:05:11] mwalker Reedy I also assume you can push stuff today. I'm not really contributing much here. [23:06:24] chrismcmahon: You are the traffic cop! [23:06:57] bd808: I'm all like, y'all work it out, whatever. [23:07:16] :) That's usually Greg's stance too. [23:17:09] (03CR) 10Dzahn: "what's worse, uncompressed logs piling up or that it would take a couple minutes to compress them every day? (asking)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125991 (https://bugzilla.wikimedia.org/63939) (owner: 10Hashar) [23:17:18] (03CR) 10Ori.livneh: [C: 031] "Brandon, +1 for the PS6/7 changes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136655 (https://bugzilla.wikimedia.org/64582) (owner: 10Ori.livneh) [23:17:52] !log mwalker Synchronized php-1.24wmf12/extensions/VisualEditor/: Updating VisualEditor for {{gerrit|144081}} (duration: 00m 12s) [23:17:56] Logged the message, Master [23:18:19] RoanKattouw, James_F; if one of you would poke your change to see if it's gone out [23:19:43] mwalker: Looking. [23:22:11] mwalker: Yes, fixed. Thanks! [23:23:11] awesome; I hearby declare this swat closed [23:23:16] * mwalker goes back into hiding [23:26:37] (03PS9) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [23:28:02] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:22] PROBLEM - gdash.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:28:32] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:28:43] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:43] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:52] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:52] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:02] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:02] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:02] PROBLEM - check configured eth on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:12] RECOVERY - gdash.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 9055 bytes in 0.013 second response time [23:29:22] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [23:29:32] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [23:29:37] (03PS1) 10Dzahn: add system group for bugzilla reporter user [operations/puppet] - 10https://gerrit.wikimedia.org/r/144087 [23:29:42] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [23:29:42] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1203 seconds ago with 0 failures [23:29:42] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [23:29:52] RECOVERY - DPKG on tungsten is OK: All packages OK [23:29:52] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [23:29:52] RECOVERY - check configured eth on tungsten is OK: NRPE: Unable to read output [23:29:52] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [23:33:20] !log rhenium (pmacct / flow) Out of memory: Kill process 3123 (pmacctd) score 1 or sacrifice child [23:33:23] Logged the message, Master [23:33:28] mwalker: May have spoken too fast there, buddy [23:33:35] damn [23:33:43] the exception and fatal log looked good [23:33:50] Oh, or someone knows what happen. [23:34:22] did someone just restart tungston? [23:34:27] *tungsten [23:34:41] no, it happens on a regular basis [23:34:52] it just gets too busy to answer to the monitoring checks anymore [23:35:00] so they are all "socket timeout" [23:35:04] Well then. No problem. [23:35:04] not actual errors in the logs [23:36:07] i wouldn't say "no", but not new or specific to this scap [23:37:10] i think we can move mwprof to hafnium [23:37:38] https://icinga.wikimedia.org/cgi-bin/icinga/histogram.cgi?host=tungsten&service=check+configured+eth&t1=1404344226&t2=1404430626&assumestateretention=yes [23:40:35] !log osmium - libboost-dev : Depends: libboost1.54-dev but it is not going to be installed [23:40:41] Logged the message, Master [23:40:46] hm? [23:41:04] just checked the puppet run, a bunch of dpkg fail [23:41:09] it's the hhvm box [23:41:34] ... install libboost-dev' returned 100: [23:42:05] i thought we perma-silenced it? [23:42:28] either way i can fix that [23:47:29] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:19] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [23:49:26] (03PS1) 10Dzahn: planet - add system group for system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/144091 [23:51:01] (03PS1) 10Dzahn: snmptt - add system group for system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/144092 [23:53:08] (03PS1) 10Dzahn: wikistats - add system group for system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/144094 [23:54:54] (03CR) 10Dzahn: [C: 032] wikistats - add system group for system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/144094 (owner: 10Dzahn) [23:55:40] (03CR) 10Dzahn: [C: 032] planet - add system group for system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/144091 (owner: 10Dzahn) [23:58:15] (03CR) 10Dzahn: [C: 032] add system group for bugzilla reporter user [operations/puppet] - 10https://gerrit.wikimedia.org/r/144087 (owner: 10Dzahn) [23:59:23] (03CR) 10Dzahn: [C: 031] bugzilla apache: Enable required modules for caching [operations/puppet] - 10https://gerrit.wikimedia.org/r/127254 (https://bugzilla.wikimedia.org/49720) (owner: 10JanZerebecki)