[00:31:33] (03CR) 10Andrew Bogott: [C: 032] labs_vagrant: Ensure that lvm volume is mounted first [operations/puppet] - 10https://gerrit.wikimedia.org/r/146632 (owner: 10BryanDavis) [00:43:32] PROBLEM - puppet last run on labsdb1002 is CRITICAL: CRITICAL: Puppet has 1 failures [00:44:12] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Puppet has 1 failures [00:44:22] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Puppet has 1 failures [00:44:42] PROBLEM - puppet last run on analytics1024 is CRITICAL: CRITICAL: Puppet has 1 failures [00:44:52] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Puppet has 1 failures [00:44:52] PROBLEM - puppet last run on db1019 is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:02] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:12] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:22] PROBLEM - puppet last run on db1058 is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:22] PROBLEM - puppet last run on elastic1016 is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:22] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:22] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:23] PROBLEM - puppet last run on tantalum is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:23] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:23] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:32] PROBLEM - puppet last run on nickel is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:33] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:42] PROBLEM - puppet last run on ssl1004 is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:42] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:02] PROBLEM - puppet last run on es1005 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:02] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:02] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:12] PROBLEM - puppet last run on search1021 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:22] PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:23] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:23] PROBLEM - puppet last run on search1009 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:32] PROBLEM - puppet last run on es7 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:32] PROBLEM - puppet last run on es4 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:33] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:42] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Puppet has 1 failures [00:46:52] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 1 failures [00:47:02] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: Puppet has 1 failures [00:47:02] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 1 failures [00:47:32] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [00:47:42] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [00:52:46] (03CR) 10Springle: "Sorry, my fault." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146527 (owner: 10Reedy) [00:58:02] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [00:58:22] RECOVERY - puppet last run on elastic1016 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [00:58:22] RECOVERY - puppet last run on tantalum is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [00:58:23] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [00:58:32] RECOVERY - puppet last run on labsdb1002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [00:58:32] RECOVERY - puppet last run on nickel is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:58:42] RECOVERY - puppet last run on ssl1004 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [00:58:42] RECOVERY - puppet last run on analytics1024 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:58:42] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [00:58:52] RECOVERY - puppet last run on db1019 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [00:59:02] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [00:59:02] RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:59:12] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [00:59:12] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [00:59:12] RECOVERY - puppet last run on search1021 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [00:59:27] RECOVERY - puppet last run on db1058 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [00:59:27] RECOVERY - puppet last run on tmh1002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [00:59:27] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:59:28] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [00:59:28] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [00:59:28] RECOVERY - puppet last run on search1009 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [00:59:28] RECOVERY - puppet last run on es4 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [00:59:28] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [00:59:37] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [00:59:37] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [00:59:47] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [00:59:47] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [00:59:47] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [00:59:57] RECOVERY - puppet last run on es1005 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [01:00:07] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [01:00:07] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [01:00:07] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 98 seconds ago with 0 failures [01:00:17] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [01:00:28] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [01:00:28] RECOVERY - puppet last run on es7 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [01:15:17] (03PS4) 10Chmarkine: update SSL ciphers for contacts.wm.org to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) [01:21:34] ^d: elastic1017 is doing it again [01:22:10] <^d> elastic1017 isn't healthy :\\ [01:24:51] <^d> Can we depool it? [01:26:45] <^d> Let's at least take it out of LVS rotation. [01:27:15] ^d: LVS wouldn't really help - requests will get sent there anyway [01:27:26] <^d> Yeah, but I think we should fully depool it. [01:27:34] <^d> Step 1 is LVS so we stop sending it requests directly. [01:27:34] I've bounced Elasticsearch on it - should cause a minor shit storm but atleast it'll be unstck [01:27:49] ^d: lvs will depool it on its own if we want to shut it down [01:27:58] <^d> True. [01:28:02] the thing is - this machine has done this three times in the past 24 horus [01:28:19] other machines have done it too - but this one three times. [01:28:25] its like something is messed up with it [01:28:26] I dunno [01:29:04] <^d> wio is going up on 18 now to compensate. [01:30:58] <^d> And 1018 is down in ganglia now. Can we take 17-19 out? [01:31:06] <^d> I think that ram is becoming a major issue. [01:31:49] ^d: we can [01:31:56] just shut down elasticsearhc and shut down puppet [01:32:03] I can do that - let me do it slowly [01:32:59] <^d> We can bring them back up tomorrow and see if we can get them working better. But I don't want to babysit them now, it's late :) [01:33:14] <^d> And the other 16 will be fine with random gone. [01:34:06] !log moving shards off of elastic101[789] [01:34:14] Logged the message, Master [01:36:04] ^d: once the shards are off of elastic101[789] I'll turn off puppet and take them out of the rotation [01:36:12] rather, shut them down [01:38:59] ^d: honestly, we can leave them on after we've asked the shards to move off [01:39:38] <^d> Nobody will care about them anymore. [01:39:44] <^d> All they'll be doing is routing some traffic. [01:47:21] ahh help anyone around? [01:49:22] <^d> manybubbles: I'm sitting down to dinner. I'm within earshot so ping if you need me. [01:49:46] i did two commits and then tried to do a git review [01:49:55] but keep getting error about how the change-id is not at the bottom [01:50:31] i added it manually but still get the error [01:51:02] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [01:56:02] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [02:13:14] (03PS1) 10Scottlee: Fixed base module Puppet 3 lint issues. [operations/puppet] - 10https://gerrit.wikimedia.org/r/146675 [02:15:35] !log LocalisationUpdate completed (1.24wmf12) at 2014-07-16 02:14:32+00:00 [02:15:43] Logged the message, Master [02:22:55] (03PS1) 10Scottlee: Fixed Puppet 3 lint issues relating to git ipython ishmael and coredb_mysql modules. [operations/puppet] - 10https://gerrit.wikimedia.org/r/146685 [02:27:16] !log LocalisationUpdate completed (1.24wmf13) at 2014-07-16 02:26:12+00:00 [02:27:20] Logged the message, Master [02:36:57] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:49] PROBLEM - RAID on es1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [02:53:11] ACKNOWLEDGEMENT - RAID on es1005 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Sean Pringle RT 7892 [03:04:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 16 03:03:41 UTC 2014 (duration 3m 40s) [03:04:52] Logged the message, Master [04:10:11] anyone awake? https://git.wikimedia.org/ does not work [04:13:40] hmm [04:13:48] antimony [04:20:23] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53288 bytes in 0.168 second response time [04:22:57] !log restarted gitblit on antimony [04:23:00] aude: ^ [04:23:01] Logged the message, Master [04:24:48] thanks! [05:42:51] I want to import articles on betalabs (eswiki), which machine I should use? Yuvi asked me to use deployment-salt, but it seems php isn't there so can't use importdump.php [05:42:57] any other quick way? [05:43:11] (or I'm wrong from scratch :)) [05:54:20] !log resuming page content model schema changes, osc_host.sh processes on terbium ok to kill in emergency [05:54:25] Logged the message, Master [05:59:22] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "We don't need this IMO:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146497 (owner: 10Ori.livneh) [05:59:57] <_joe_> good morning springle [06:00:10] hey _joe_ [06:00:59] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0366666666667 [06:01:30] (03CR) 10Giuseppe Lavagetto: "I don't like copytruncate a lot, you lose (almost certainly) some log lines in the process. Today I'll come up with a better solution." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 (owner: 10Ori.livneh) [06:01:45] <_joe_> search is aching again [06:29:01] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:21] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:21] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:22] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:22] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:31] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:41] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:41] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:41] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:43] ah that time of day [06:29:51] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:51] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:52] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:52] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:01] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:31] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:41] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:08] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "One small correction, LGTM otherwise" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146526 (owner: 10Matanya) [06:38:02] (03CR) 10Matanya: mailman: monitor number of running processes (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146526 (owner: 10Matanya) [06:45:31] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:31] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on mw1088 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:51] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:45:51] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:45:52] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:01] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:47:01] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:02:48] (03CR) 10Giuseppe Lavagetto: [C: 032] mailman: monitor number of running processes (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146526 (owner: 10Matanya) [07:04:04] <_joe_> matanya: thanks a lot [07:04:16] <_joe_> I would've never had the time to follow up on this now [07:04:54] (03PS3) 10Giuseppe Lavagetto: mediawiki: move File['/usr/local/apache'] from web.pp -> sync.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/146142 (owner: 10Ori.livneh) [07:05:40] sure _joe_, i'll do the rest of request on the tickets later today [07:13:13] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 05:12:56 UTC [07:13:23] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Wed Jul 16 07:13:17 UTC 2014 [07:18:13] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 05:17:41 UTC [07:19:01] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The /usr/local/apache dir has currently 0755 rights to it, we should not change that." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146142 (owner: 10Ori.livneh) [07:23:03] (03PS4) 10Giuseppe Lavagetto: mediawiki: move File['/usr/local/apache'] from web.pp -> sync.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/146142 (owner: 10Ori.livneh) [07:24:16] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: move File['/usr/local/apache'] from web.pp -> sync.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/146142 (owner: 10Ori.livneh) [07:24:54] (03PS2) 10Giuseppe Lavagetto: Fold mediawiki::web::config back into mediawiki::web [operations/puppet] - 10https://gerrit.wikimedia.org/r/146495 (owner: 10Ori.livneh) [07:29:54] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Complete puppet failure [07:30:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Fold mediawiki::web::config back into mediawiki::web [operations/puppet] - 10https://gerrit.wikimedia.org/r/146495 (owner: 10Ori.livneh) [07:32:17] <_joe_> bbiab [08:35:31] (03PS3) 10Giuseppe Lavagetto: Add logrotation for /var/log/mediawiki/* [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 (owner: 10Ori.livneh) [08:36:16] <_joe_> godog: ^^ [08:36:22] <_joe_> can you take a look? [08:37:59] 146607 ? [08:39:11] <_joe_> godog: yes [08:39:27] yep in some minutes, finishing up the swift upgrade [08:39:36] !log depool ms-fe1002 for swift upgrade [08:39:42] Logged the message, Master [08:39:54] <_joe_> godog: just to give you context, we don't handle logs and rotation with upstart because we want to log to /var/log/mediawiki [08:39:57] <_joe_> ok sorry [08:43:25] no worries, it shouldn't take long [08:46:03] !log repool ms-fe1002 and depool ms-fe1003 [08:46:08] Logged the message, Master [08:51:56] !log repool ms-fe1003 and depool ms-fe1004 [08:52:01] Logged the message, Master [08:54:07] (03PS6) 10Giuseppe Lavagetto: Add init and upstart scripts [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 [08:54:17] !log repool ms-fe1004 [08:54:22] Logged the message, Master [08:54:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add init and upstart scripts [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 (owner: 10Giuseppe Lavagetto) [08:57:22] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Wed Jul 16 08:57:17 UTC 2014 [09:00:54] (03CR) 10Giuseppe Lavagetto: [C: 032] Redirect wikimedia.org/research to survey [operations/apache-config] - 10https://gerrit.wikimedia.org/r/146334 (owner: 10Reedy) [09:15:58] (03PS1) 10Giuseppe Lavagetto: Release a new apache config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146731 [09:22:40] hello [09:25:08] <_joe_> akosiaris: hello my friend [09:25:18] <_joe_> did you enjoy your long weekend? [09:26:00] yes :-). Very very much. I managed to go about without a laptop or internet for 5 days. Sweet :-) [09:31:47] hi akosiaris glad to have you back [09:32:27] <_joe_> akosiaris: so now you're back to the real fun and you feel relieved, nice [09:33:01] 530 emails ? [09:33:09] <_joe_> ;) [09:33:25] I was away for 5 days!!! 2 of which were a weekend... [09:33:28] sigh [09:35:52] (03CR) 10Filippo Giunchedi: [C: 031] "looks good to me!" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 (owner: 10Ori.livneh) [09:39:33] hey akosiaris [09:40:25] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 07:39:42 UTC [09:41:35] hey godog [09:42:20] and we even had some delays with mailman! [09:42:34] <_joe_> "delays" [09:42:40] (03PS4) 10Giuseppe Lavagetto: Add logrotation for /var/log/mediawiki/* [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 (owner: 10Ori.livneh) [09:42:51] <_joe_> :) [09:43:37] (03CR) 10Filippo Giunchedi: [C: 031] Add logrotation for /var/log/mediawiki/* [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 (owner: 10Ori.livneh) [09:44:07] (03PS5) 10Giuseppe Lavagetto: Add logrotation for /var/log/mediawiki/* [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 (owner: 10Ori.livneh) [09:44:13] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add logrotation for /var/log/mediawiki/* [operations/puppet] - 10https://gerrit.wikimedia.org/r/146607 (owner: 10Ori.livneh) [09:56:42] (03CR) 10Filippo Giunchedi: "I looked at trusty's apache2 init script and it seems that inside do_stop if configtest fails apache is forcibly stopped though, so we sti" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146497 (owner: 10Ori.livneh) [09:59:07] <_joe_> godog: ouch, how can they be so braindead? [09:59:22] <_joe_> it has worked like that in debian _forever_ [10:00:10] <_joe_> unbelievable [10:00:18] <_joe_> the new init script is pure garbage [10:00:26] <_joe_> (reading it now) [10:02:04] heh a man's bug is another man's feature :)) [10:02:11] <_joe_> "The apache2$DIR_SUFFIX configtest failed, so we are trying to kill it manually. This is almost certainly suboptimal, so please make sure your system is working as you'd expect now!" [10:02:35] <_joe_> godog: come on this does not make sense in almost _any_ situation [10:02:43] <_joe_> let me check the reload part as well then [10:03:05] <_joe_> ok reload still works as expected. [10:04:21] part of the problem is that restart = stop + start, so it can't tell whether you want to stop or restart, in case you want to stop the above makes sense [10:04:38] <_joe_> godog: exactly [10:04:55] <_joe_> godog: that's why the configtest was done _before_ stopping before [10:05:19] <_joe_> I may even say we should patch the init script [10:07:26] if apache is managed by puppet it seems more explicit and robust to check there instead of relying on a patched init script to DTRT [10:08:36] <_joe_> godog: mmmmh I don't agree - we want the init script to do "The right thing" even when not actionated by puppet [10:09:49] if that's the case that apache config might get changed by something else than puppet then yes I agree [10:10:13] <_joe_> it should not, on mediawikis [10:10:28] <_joe_> anyway, I'm 99% sure that change doesn't do what ori wanted [10:10:41] <_joe_> and this is -again- puppet being horrible [10:10:45] <_joe_> but let me check that [10:16:24] <_joe_> oh wow, with puppet 3 it behaves correctly! [10:17:29] (03CR) 10Giuseppe Lavagetto: [C: 031] "Amazed by how braindead this change in the init script is, I change my vote then." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146497 (owner: 10Ori.livneh) [10:20:37] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Wed Jul 16 10:20:29 UTC 2014 [10:28:37] (03PS2) 10Giuseppe Lavagetto: Release a new apache config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146731 [10:29:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Release a new apache config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146731 (owner: 10Giuseppe Lavagetto) [10:32:04] <_joe_> !log releasing a new apache config to all mediawikis [10:32:10] Logged the message, Master [10:32:28] (03PS1) 10Tim Landscheidt: ldaplist: Remove servicegroups.old [operations/puppet] - 10https://gerrit.wikimedia.org/r/146747 [10:41:19] (03PS1) 10Tim Landscheidt: Tools: Install pdf2svg [operations/puppet] - 10https://gerrit.wikimedia.org/r/146748 (https://bugzilla.wikimedia.org/68092) [11:05:54] (03PS1) 10QChris: Fix documentation of default path for wikimetrics base [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/146753 [11:05:57] <_joe_> lunch. bbiab [11:23:43] (03PS1) 10Manybubbles: Turn cirrus off on more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146755 [11:23:55] (03PS1) 10Matanya: mailman: monitor queue size [operations/puppet] - 10https://gerrit.wikimedia.org/r/146756 [11:24:02] anyone around? I want to do a quick mediawiki config deploy [11:24:10] (03CR) 10Manybubbles: [C: 032] Turn cirrus off on more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146755 (owner: 10Manybubbles) [11:24:23] (03Merged) 10jenkins-bot: Turn cirrus off on more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146755 (owner: 10Manybubbles) [11:25:16] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: Take Cirrus as default from more wikis while we figure out load issues (duration: 00m 06s) [11:25:21] Logged the message, Master [11:27:46] (03PS2) 10Matanya: mailman: monitor queue size [operations/puppet] - 10https://gerrit.wikimedia.org/r/146756 [11:56:01] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [12:01:25] <_joe_> !log removed stale files from /etc/apache2/conf-enabled on all mw hosts [12:05:19] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:06:10] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 6.137 second response time [12:09:19] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:09] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 2.099 second response time [12:10:22] (03Abandoned) 10QChris: Reflect move of refinery script to drop partitions [operations/puppet] - 10https://gerrit.wikimedia.org/r/145980 (owner: 10QChris) [12:13:42] oblivian is doing a graceful restart of all apaches [12:14:57] !log oblivian gracefulled all apaches [12:15:03] Logged the message, Master [12:30:10] (03PS1) 10Giuseppe Lavagetto: Make all.conf read from the new path [operations/apache-config] - 10https://gerrit.wikimedia.org/r/146760 [12:30:57] (03CR) 10Reedy: [C: 031] Make all.conf read from the new path [operations/apache-config] - 10https://gerrit.wikimedia.org/r/146760 (owner: 10Giuseppe Lavagetto) [12:31:18] (03CR) 10Giuseppe Lavagetto: [C: 032] Make all.conf read from the new path [operations/apache-config] - 10https://gerrit.wikimedia.org/r/146760 (owner: 10Giuseppe Lavagetto) [12:33:33] (03PS1) 10Giuseppe Lavagetto: mediawiki: read from the correct path for the apache config files [operations/puppet] - 10https://gerrit.wikimedia.org/r/146761 [12:35:08] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: read from the correct path for the apache config files [operations/puppet] - 10https://gerrit.wikimedia.org/r/146761 (owner: 10Giuseppe Lavagetto) [12:44:52] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:52:07] <_joe_> this is puppet-merge failing to sync submodules [12:52:29] <_joe_> shit [12:53:52] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:54:22] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:13] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 4.930 second response time [12:59:22] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:04] K4-713: The time is nigh to deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140716T1300) [13:00:22] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 9.021 second response time [13:01:52] PROBLEM - puppetmaster backend https on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:02] (03PS1) 10Matanya: mailman: monitor web and archive access [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 [13:02:46] RECOVERY - puppetmaster backend https on strontium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.028 second response time [13:03:54] (03CR) 10Andrew Bogott: [C: 032] ldaplist: Remove servicegroups.old [operations/puppet] - 10https://gerrit.wikimedia.org/r/146747 (owner: 10Tim Landscheidt) [13:04:54] oblivian is doing a graceful restart of all apaches [13:06:11] !log oblivian gracefulled all apaches [13:06:39] (03CR) 10Andrew Bogott: [C: 032] Tools: Install pdf2svg [operations/puppet] - 10https://gerrit.wikimedia.org/r/146748 (https://bugzilla.wikimedia.org/68092) (owner: 10Tim Landscheidt) [13:20:35] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 13:18:19 UTC [13:22:35] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 13:18:19 UTC [13:24:35] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 13:18:19 UTC [13:25:32] (03CR) 10Filippo Giunchedi: [C: 031] apache module: test config before attempting restart [operations/puppet] - 10https://gerrit.wikimedia.org/r/146497 (owner: 10Ori.livneh) [13:26:35] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 13:18:19 UTC [13:27:26] (03PS3) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [13:28:28] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: manage single configs via apache::site (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 (owner: 10Giuseppe Lavagetto) [13:28:35] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 13:18:19 UTC [13:30:35] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 13:18:19 UTC [13:32:35] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 13:18:19 UTC [13:34:35] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 13:18:19 UTC [13:36:35] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 13:18:19 UTC [13:36:52] (03CR) 10Filippo Giunchedi: mailman: monitor web and archive access (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 (owner: 10Matanya) [13:37:45] RECOVERY - Puppet freshness on elastic1016 is OK: puppet ran at Wed Jul 16 13:37:36 UTC 2014 [13:39:35] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 13:37:36 UTC [13:40:16] (03PS2) 10Matanya: mailman: monitor web and archive access [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 [13:40:48] RECOVERY - Puppet freshness on elastic1016 is OK: puppet ran at Wed Jul 16 13:40:43 UTC 2014 [13:43:16] (03PS4) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [13:49:20] (03CR) 10Filippo Giunchedi: mailman: monitor web and archive access (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 (owner: 10Matanya) [13:49:34] matanya: almost there :) thanks for the code review! [13:50:17] thank you for looking at it [13:51:30] (03PS5) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [13:51:35] * YuviPanda pokes Coren about the graphite server RT [13:53:45] (03PS3) 10Matanya: mailman: monitor web and archive access [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 [13:55:42] YuviPanda: Lemme to go see what's up. [13:55:54] Coren: ty! :) [13:58:10] (03PS6) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [14:05:51] (03PS7) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [14:11:26] matanya: one last thing, /pipermail/ is served only on http, listinfo only on https (only modulo redirects that is) [14:19:17] YuviPanda: I stand by my reluctance to use SSD for that; but I'm not going to block progress over it. Just be aware that I'll expect only the graphite data goes on SSD; I'm certainly not going to put the alerting stuff there. :-) [14:20:00] do anyone know if gerrit could be fixed to avoid following mail spam? http://i.imgur.com/e5NH2wV.png [14:20:27] Coren: heh, yeah, it's got spinning disks as well :) [14:20:31] YuviPanda: Though I suppose we should be able to configure the kernel to be really agressive about writebehind delays and reduce the writes. [14:20:37] Coren: yeah [14:20:41] Coren: and also backup :) [14:21:24] Coren: so I guess we'll put the OS, etc on spinning disks and setup /srv to be the full SSD, unstriped [14:21:57] YuviPanda: You still want to stripe it to equalize the write load. [14:22:05] ah, right. I meant any redundancy [14:22:07] in the SSDs [14:22:27] (03CR) 10Ottomata: [C: 032 V: 032] Fix documentation of default path for wikimetrics base [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/146753 (owner: 10QChris) [14:22:43] manybubbles: I shoudl merge this, yes? [14:22:43] https://gerrit.wikimedia.org/r/#/c/146475/ [14:25:43] * AzaToth ties down YuviPanda under a rock and demands an answer [14:26:11] AzaToth: sadly, out of my knowledge levels. you want qchris [14:26:14] ottomata: sure. it won't actually do anything other than make sure the settings are applied if we bounce the whole cluster [14:26:16] okai [14:26:28] Nonononono. [14:26:32] Coren: btw, we *can* mirror the SSDs, but then we'll get only 300G of data. That should still be enough for a year or more. [14:26:48] qchris: I suddenly got 5 copies of a gerrit mesage; any issues? see http://i.imgur.com/e5NH2wV.png [14:26:51] (03PS5) 10Ottomata: Make permanent some Elasticsearch config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 (owner: 10Manybubbles) [14:26:57] (03CR) 10Ottomata: [C: 032 V: 032] Make permanent some Elasticsearch config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146475 (owner: 10Manybubbles) [14:26:59] YuviPanda: The question really is "how precious is that data" [14:27:15] right, I definitely consider it 'precious' enough to have backups [14:27:25] but that's my general attitude to *any* data... [14:27:39] AzaToth: Hard to say without seeing the real emails and what they refer to :-/ [14:28:43] qchris: http://paste.debian.net/110095/ [14:28:58] they all are the "same" [14:29:02] only the name differs [14:30:03] qchris: seems I'm CC on all messages [14:30:13] dunno why [14:30:17] AzaToth: I guess you are a reviewer of the change? [14:30:45] qchris: yes, but why whould a reviewer get all individual messages? [14:31:12] AzaToth: Because the reviewers were added individually. [14:31:40] qchris: makes no sense to make such connection [14:31:40] AzaToth: When the first was added, gerrit did not know that another will get added. So it cannot batch them up [14:31:52] AzaToth: without jumping through hoops. [14:31:55] uh? [14:32:17] when you add a reviewer, why whould it need to send out mail to every other reviwer? [14:33:09] makes no senste to me [14:33:12] sense* [14:33:34] AzaToth: I never looked, because it never hurt me. Let me look it up a bit. [14:34:09] I've never noticed it before [14:34:24] dunno if it has happend before even [14:34:50] <_joe_> !log moving the stale conf-enabled directory away on jobrunners, or when we upgrade to trusty all hell will break loose [14:34:55] Logged the message, Master [14:35:26] _joe_: always nice with some hell now and then, it gets the community together [14:35:56] <_joe_> AzaToth: not if you're the one that will be chasing daemons afterwards :P [14:35:57] heh [14:37:09] * YuviPanda files RT to randomly disrupt servers once in a while [14:37:16] qchris: the first mail was to me directly, not CC, the others where CC [14:38:53] bd808|BUFFER: ping please when available :) [14:47:02] (03PS8) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [14:47:17] AzaToth: Skimming throught the code it seems you're getting CCed becouse you voted on the change. [14:48:44] AzaToth: But I could not find a way to turn such emails off. [14:49:15] qchris: I've not voted on that change [14:49:40] oh, [14:49:43] AzaToth: ? ... It shows a CR+1 for you. [14:49:47] I reviewd it [14:49:57] Ok :-D [14:50:09] heh, thought about bugzilla votes [14:50:47] qchris: I made that review after I got my mail, and then I got all the others [14:51:15] (03PS9) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [14:51:16] I assume that makes sense for someone, but not for me [14:52:11] or it could be a bug, in that it had all the mails queued, and when I made my review, it accidentally inserted me as CC on all those mails still in the queue [14:52:12] AzaToth: The reviewers got added after you've done your review. By reviewing, gerrit takes you as interested in that change. [14:52:21] hmm [14:52:45] I remember seeing all the reviewers in the list when I visited the page before I "voted" [14:53:07] You voted on 14:16 ... the Email you pasted is from 14:17. [14:53:21] hmm, perhaps he added some more [14:53:32] but still, why should I get a copy of their mails? [14:53:39] Also the code says "// CC anyone else who has posted an approval mark on this change" [14:53:48] ah [14:53:51] So I guess it's not a bug, but on purpose. [14:54:00] but why? [14:54:23] I can't figure out any purpose of such CC [14:55:22] can you? [14:56:00] AzaToth: Let me track it down in the repo history. Maybe that has an explicit example ... [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140716T1500) [15:00:52] * anomie sees no patches for SWAT this morning [15:03:53] qchris: git log -S'CC anyone else who has posted an approval mark on this change' doesn't give much meat [15:11:02] (03PS10) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [15:11:28] AzaToth: bf1f5c69c541f20ca3a10fcd2a250edb0107dc76 is your friend. [15:11:42] AzaToth: To me, it's not a bug, but on purpose. [15:14:02] AzaToth: One could turn sending such emails into a configuration option ... but gerrit does not get much wmf resources. [15:28:17] (03PS11) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [15:35:59] (03PS12) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (one server) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [15:36:21] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: manage single configs via apache::site (one server) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 (owner: 10Giuseppe Lavagetto) [15:40:21] (03PS13) 10Giuseppe Lavagetto: mediawiki: manage single configs via apache::site (one server)) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 [15:48:16] (03PS1) 10Dzahn: enhance mailman monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/146796 [15:51:56] (03CR) 10Giuseppe Lavagetto: [C: 031] enhance mailman monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/146796 (owner: 10Dzahn) [15:52:04] (03PS2) 10Dzahn: enhance mailman monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/146796 [15:52:25] i broke this too? [15:52:45] i should take a break :D [15:52:46] <_joe_> matanya: no, we both assumed we didn't have a check [15:52:52] <_joe_> we had a broken one instead :) [15:53:04] matanya: no, we had an existing check all the time, but it did not work [15:53:04] <_joe_> we should never assume _anything_ [15:53:36] yes. no assumption [15:54:05] matanya: if we have other proc checks that just use 'args' and then the lower value is 1... [15:54:09] then i think they are also broken [15:54:24] (03CR) 10Giuseppe Lavagetto: "Puppet compiler results seem good:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 (owner: 10Giuseppe Lavagetto) [15:54:44] i think we do mutante [15:54:48] we had this happen before (icinga check sees itself) [15:54:56] it made me use --ereg-argument-array [15:57:11] jenkins? [16:00:47] (03Abandoned) 10Dzahn: replace literal tabs in role/cache.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/146112 (owner: 10Dzahn) [16:02:47] even after a "git submodule update" i STILL have modified: modules/nginx in my unrelated patch? sssighhh [16:06:01] mutante: yeah, I usually cd into the submodule and do a git reset --hard... [16:06:44] git reset --hard origin/production [16:06:44] fatal: ambiguous argument 'origin/production': unknown revision or path not in the working tree. [16:06:51] ah, master ?:P [16:07:18] YuviPanda: now it's not 'modules/nginx' being modified but just 'nginx' :p [16:07:30] yeah, master. [16:08:27] that's becoming the part I hate most about the submodules. I go do a git pull on my clean checkout, make a change, git add . + git commit, and I've got some random submodule sha1 in my commit because I didn't notice to do git submodule update [16:08:56] bblack: exactly [16:09:23] and if a change has been sitting in gerrit for a bit and you want to amend [16:10:01] really, git's submodule feature just kinda sucks. I hate to say that because the rest of git is pretty awesome, but submodules are hacky [16:10:12] and all of a sudden i'm in ' Merge "Do not install debug symbols"" [16:10:35] but git add . is dangerous anyway. You should use git commit -p or git add -u [16:11:07] even if I'm explicit, it's annoying that it clutters my "git status" output, etc [16:11:21] yeah, that's true [16:11:29] and it will still get in the way when you're doing things like stashing temporary changes, flipping between branches, etc [16:11:30] git subtree is a newer replacement, but haven't used it yet [16:11:39] !log Killed jenkins [16:11:42] so what is this: modified: nginx (new commits) [16:11:44] Logged the message, Master [16:11:50] as opposed to modules/nginx [16:12:08] !log Restarted jenkins [16:12:14] Logged the message, Master [16:12:28] Reedy: aaah, thanks [16:12:33] hmm I hadn't seen subtrees yet, it's not in the git ervs I have on my boxes. reading up now :) [16:14:08] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Wed 16 Jul 2014 14:13:32 UTC [16:16:44] greg-g, can i switch my window to now? [16:17:17] it seems like there is noone scheduled for 9-10 [16:17:30] yurikJerusalem: sure [16:17:33] greg-g, thx [16:17:44] oh, you're in Jerusalem? [16:18:00] (03CR) 10JanZerebecki: [C: 031] update SSL ciphers for contacts.wm.org to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) (owner: 10Chmarkine) [16:18:20] come say hello yurikJerusalem [16:18:35] yeah! [16:18:42] greg-g, yep! [16:18:50] its awesome here! :D [16:18:51] YuviPanda: eh, now it's completely messed up. git review -d .., changing to modules/nginx, doing a hard --reset.. and ..HEAD is now at 0f6af76 Merge "Do not install debug symbols" [16:19:04] abandons [16:19:15] yurikJerusalem: pm if you want to meet :) [16:19:33] matanya, thx! i'm still trying to work out my schedule [16:19:35] crazy :) [16:20:03] a completely different thing by coren... [16:20:17] (03PS3) 10Dzahn: bugzilla - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 [16:20:32] mutante: Eh, wut? [16:21:45] Coren: i have issues with submodules...i don't know, i just get your change when all i want is amend to mine [16:22:01] mutante: What change is that? [16:22:31] Coren: what i want is make 146461 NOT touch modules/nginx [16:23:09] Coren: then when i do a git reset --hard origin/master in there... for some reason i get to: 0f6af76 Merge "Do not install debug symbols [16:23:57] and after that i don't have "modified: modules/nginx" but instead 'modified: nginx' [16:24:17] (03CR) 10Yuvipanda: [C: 031] Add line to collect Puppet failures. [operations/puppet] - 10https://gerrit.wikimedia.org/r/144737 (owner: 10Scottlee) [16:24:30] (03CR) 10jenkins-bot: [V: 04-1] bugzilla - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 (owner: 10Dzahn) [16:24:59] (03CR) 10Dzahn: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146796 (owner: 10Dzahn) [16:25:26] mutante: I... have no idea how that happens. [16:25:44] is jenkins dead? [16:26:33] mutante: I've fixed it locally... Let me just set you back to being the author :P [16:26:35] yurikJerusalem: why you think so? [16:26:40] YuviPanda: ok I read up on the subtree method, it actually does seem a lot better for e.g. what we do with puppet. Although who knows until someone tries it in practice. [16:26:41] (03PS4) 10Yuvipanda: bugzilla - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 (owner: 10Dzahn) [16:26:42] mutante: ^ [16:26:57] Reedy: YuviPanda: thanks ! [16:27:11] andre__, it was 40 min since +2 of https://gerrit.wikimedia.org/r/#/c/146633/ [16:27:14] it would be nice from the pov of automated checkouts on the puppetmasters and such too, because they will see it as a simple repo with no strange complexities [16:27:24] mutante: so I just did 'git reset HEAD^' and it unstaged my changes, then 'git commit -p' to just pick the ones I want [16:27:36] yurikJerusalem: it was stuck, Reedy kicked it [16:27:44] good thx [16:28:24] (03PS5) 10Yuvipanda: bugzilla - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 (owner: 10Dzahn) [16:28:25] mutante: updated author back to you ^ [16:28:38] in general I find git commit -p to be good practice [16:29:16] YuviPanda: ok, thank you. just that i've had the same thing all the time.. and could solve it by just updating the submodules and amending [16:29:29] hmm, right [16:29:45] git commit -p also forces you to look at the diff before commiting, which I find nice [16:30:04] bblack: yeah, true. git subtree has been in stable git releases for a while now too, I think. Not much usage tho [16:30:21] yeah I just was looking at the wrong manpages :) [16:30:31] * YuviPanda has a slightly weird git workflow that relies heavily on git commit -p and git reflog [16:30:58] I only really use reflog when I screw something up [16:31:32] yeah, so I don't use git review, so I'll be on an arbit unnamed commit working on something, and then pick up another change with fetch + checkout (copy pasted), and then if I want to I'll use reflog to go back [16:31:50] I should start using branches as I was using before, but getting patches someone else has amended kinda ruins it a bit... [16:32:00] or I've to use git review, which I'm not too much of a fan of [16:34:59] Reedy, i think jenkins might need another kick :( [16:35:11] you fixed it for a minute [16:35:15] but it's gone again? [16:35:31] slaves look to be doing stuff at least [16:36:29] some of 'em yeah [16:40:13] 'untracked files: modules/cdh4' arrg [16:42:56] re: nginx module.. cant we use it for actual production SSL ? [16:43:15] i mean, all this hassle with it but then we actually edit ./puppet/templates/nginx instead of the module [16:46:56] (03PS1) 10Dzahn: nginx - remove cipher kEDH+AESGCM [operations/puppet] - 10https://gerrit.wikimedia.org/r/146806 [16:47:29] (03PS2) 10Dzahn: nginx - remove cipher kEDH+AESGCM [operations/puppet] - 10https://gerrit.wikimedia.org/r/146806 [16:49:01] !log Restarted jenkins again [16:49:06] Logged the message, Master [16:49:47] (03PS6) 10Dzahn: bugzilla - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 [16:51:14] (03PS3) 10Dzahn: gerrit - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146464 [16:51:15] gallium is processing again... [16:52:02] (03PS2) 10Dzahn: dynamicproxy - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146466 [16:54:36] (03CR) 10Dzahn: [C: 032] update SSL ciphers for contacts.wm.org to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) (owner: 10Chmarkine) [16:54:38] (03CR) 10CSteipp: [C: 031] nginx - remove cipher kEDH+AESGCM [operations/puppet] - 10https://gerrit.wikimedia.org/r/146806 (owner: 10Dzahn) [16:56:17] (03CR) 10Dzahn: [V: 032] "contacts doesn't have many users, but yea, consistency is great" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) (owner: 10Chmarkine) [16:56:30] greg-g, jenkins is kinda slow - still waiting for it to merge my wmf/1.24wmf12 & 13... [16:57:02] dogeydogey: can you tell me about <%= @mw_primary %>? Is mw_primary an array? Or does that @ mean something else? [16:57:12] * andrewbogott not very familiar with .erb syntax [16:57:24] it's just in regards to this https://etherpad.wikimedia.org/p/Puppet3 [16:57:33] I updated wherever there was a lint error [16:57:34] * andrewbogott not very familiar with the verb 'to be' [16:57:40] andrewbogott: when puppet renders an erb, it passes it a context object [16:57:47] * andrewbogott talk like Cookie Monster [16:57:47] all local-scoped vars are instance vars on that object [16:57:55] '@' is ruby's syntax for accessing an instance var [16:58:03] manybubbles, ^d, i'm still waiting for jenkins to merge core :( [16:58:03] andrewbogott: it's a puppet 3 thing https://projects.puppetlabs.com/issues/19058 [16:58:11] running a bit late because of it :( [16:58:14] puppet tolerates you omitting it, but it's deprecated by puppet 3, and will be removed in some future version [16:58:33] yurikJerusalem: did you mean to ping me? [16:58:35] andrewbogott: it's explainde here: http://docs.puppetlabs.com/guides/templating.html#referencing-variables [16:58:37] oh, yeah, its ok [16:58:37] So basically for any <%= foo %> it should correctly be <%= @foo %> ? [16:58:40] we're canceling that deploy [16:58:44] andrewbogott: yes [16:58:45] Regardless of type? [16:58:45] because of the brownout yesterday [16:58:46] * andrewbogott reads [16:58:49] nod [16:58:57] you can have my time as far as I'm concerned [16:59:24] manybubbles, thx :) [16:59:31] greg-g, ^ [16:59:32] andrewbogott: yea, pretty much https://gerrit.wikimedia.org/r/#/c/145493/3/templates/apache/sites/wikitech.wikimedia.org.erb [17:00:04] manybubbles, ^d: The time is nigh to deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140716T1700) [17:01:09] yurikJerusalem: kk [17:01:32] (03CR) 10Dzahn: "Chmarkine, thanks for the patch, that being said.. contacts.wm currently just uses a self-signed cert" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146510 (https://bugzilla.wikimedia.org/53259) (owner: 10Chmarkine) [17:02:13] Krinkle: around? thoughts about Jenkins? Reedy already kicked it once in the past hour ish, but it seems like the gate-and-summit queue isn't clearing [17:02:50] greg-g: in a meeting, will check out in 10 min [17:03:45] thanks [17:04:01] also, I was talking about a summit in another channel, obviously gate-and-submit ;) [17:06:51] (03PS2) 10Ori.livneh: apache module: test config before attempting restart [operations/puppet] - 10https://gerrit.wikimedia.org/r/146497 [17:07:09] (03CR) 10Ori.livneh: [C: 032] apache module: test config before attempting restart [operations/puppet] - 10https://gerrit.wikimedia.org/r/146497 (owner: 10Ori.livneh) [17:07:40] greg-g: we should probably have an alert when the the amount of time a CR+2'd patch has been waiting on jenkins exceeds some threshold [17:09:00] ori: agreed.... [17:10:10] (03CR) 10Dzahn: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146796 (owner: 10Dzahn) [17:11:00] (03PS1) 10Ori.livneh: jobrunner: whitespace / formatting tweaks [operations/puppet] - 10https://gerrit.wikimedia.org/r/146816 [17:11:14] (03CR) 10Dzahn: "1h 20m after upload, 2 attempts to recheck, 3 restarts of jenkins, no vote" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146796 (owner: 10Dzahn) [17:13:05] (03CR) 10Ori.livneh: [C: 032 V: 032] "trivial; whitespace only change" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146816 (owner: 10Ori.livneh) [17:13:15] greg-g, Reedy grrr... takes foreva [17:13:27] still hasn't merged it ;9 [17:13:29] :( [17:13:39] nope, not getting jenkins votes either [17:15:37] mutante: seems like you need to restart zuul [17:15:48] probably? [17:15:55] The head of the gate-and-submit zuul queue is a patch that was manually merged... [17:16:00] Maybe just a recheck? [17:16:17] greg-g, i'm running too late, can't deploy any more :( [17:16:31] any way to cancel the +2 of core merge? [17:17:11] guh, revert? [17:17:12] (03CR) 10JanZerebecki: [C: 031] nginx - remove cipher kEDH+AESGCM [operations/puppet] - 10https://gerrit.wikimedia.org/r/146806 (owner: 10Dzahn) [17:18:10] greg-g, revert of something that hasn't been merged yet? :) [17:18:15] (03CR) 10JanZerebecki: [C: 031] gerrit - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146464 (owner: 10Dzahn) [17:18:16] Remove the +2 on gerrit [17:20:34] Reedy: should we kick Jenkins again, or zuul? [17:21:02] Reedy, removed, thx [17:21:08] pls don't merge it [17:21:17] And/or abandon [17:21:20] or -2 [17:22:42] -2ed [17:29:33] Jenkins is doing gate-and-submit now... [17:30:56] (03PS3) 10Dzahn: enhance mailman monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/146796 [17:31:11] (03CR) 10Andrew Bogott: [C: 04-1] "This looks great, just one misplaced space." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146685 (owner: 10Scottlee) [17:32:07] (03CR) 10Andrew Bogott: [C: 032] Add line to collect Puppet failures. [operations/puppet] - 10https://gerrit.wikimedia.org/r/144737 (owner: 10Scottlee) [17:34:03] (03CR) 10Andrew Bogott: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144737 (owner: 10Scottlee) [17:35:24] merge break [17:39:27] win 1 [17:40:25] (03PS1) 10Cmjohnson: removing osm-db1001/2 entries changing mgmt to labsdb1006/7 [operations/dns] - 10https://gerrit.wikimedia.org/r/146822 [17:40:30] mary, hi fellow irssi user [17:41:32] halu mutante [17:42:36] (03PS3) 10Andrew Bogott: dynamicproxy: Pass through existing XFF data too [operations/puppet] - 10https://gerrit.wikimedia.org/r/106907 (owner: 10Stwalkerster) [17:46:13] (03CR) 10Andrew Bogott: [C: 032] "Sorry for the delay in merging!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/106907 (owner: 10Stwalkerster) [17:47:11] (03PS1) 10Dzahn: put contacts.wm.org behind misc. varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/146823 [17:49:48] (03PS2) 10Dzahn: put contacts.wm.org behind misc. varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/146823 [17:50:58] (03PS2) 10Cmjohnson: removing osm-db1001/2 entries changing all to labsdb1006/7 [operations/dns] - 10https://gerrit.wikimedia.org/r/146822 [17:52:00] (03CR) 10Dzahn: [C: 032] enhance mailman monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/146796 (owner: 10Dzahn) [17:52:27] PROBLEM - DPKG on osm-db1001 is CRITICAL: Timeout while attempting connection [17:52:27] PROBLEM - RAID on osm-db1001 is CRITICAL: Timeout while attempting connection [17:52:28] PROBLEM - check configured eth on osm-db1001 is CRITICAL: Timeout while attempting connection [17:52:28] PROBLEM - puppet disabled on osm-db1001 is CRITICAL: Timeout while attempting connection [17:52:37] PROBLEM - check if dhclient is running on osm-db1001 is CRITICAL: Timeout while attempting connection [17:52:37] PROBLEM - puppet last run on osm-db1001 is CRITICAL: Timeout while attempting connection [17:52:58] my merge should now also fix puppet fail on sodium [17:54:07] PROBLEM - Host osm-db1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:08] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:55:17] PROBLEM - puppet last run on osm-db1002 is CRITICAL: Timeout while attempting connection [17:55:28] PROBLEM - RAID on osm-db1002 is CRITICAL: Timeout while attempting connection [17:55:28] PROBLEM - check configured eth on osm-db1002 is CRITICAL: Timeout while attempting connection [17:55:37] PROBLEM - Disk space on osm-db1002 is CRITICAL: Timeout while attempting connection [17:55:37] PROBLEM - check if dhclient is running on osm-db1002 is CRITICAL: Timeout while attempting connection [17:55:37] PROBLEM - DPKG on osm-db1002 is CRITICAL: Timeout while attempting connection [17:55:37] PROBLEM - puppet disabled on osm-db1002 is CRITICAL: Timeout while attempting connection [17:55:54] cleaning these now [17:56:27] mutante: FWIW pondering switch to weechat [17:56:57] PROBLEM - Host osm-db1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:57:06] mary: i see.. never tried.. didn't feel the need to change things [17:57:07] !log clean puppet stored config database for osm-db100{1,2}.eqiad.wmnet, updating icinga [17:57:11] Logged the message, Master [18:00:04] yurik: The time is nigh to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140716T1800) [18:01:37] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 2 processes with regex args /mailman/bin/mailmanctl [18:01:47] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 9 processes with regex args /mailman/bin/qrunner [18:01:47] mutante: ^ [18:01:58] matanya: wow you are fast [18:02:45] akosiaris: the vacation made you forget that? :P [18:02:47] RECOVERY - RAID on db1019 is OK: OK: optimal, 1 logical, 2 physical [18:03:17] It made me forget a lot of things :) [18:03:22] but I am catching up [18:03:34] same issue, the check finds itself [18:03:39] it's actually always -1 [18:03:46] i assumed so [18:03:48] 1 and 8 , not 2 and 9 [18:04:12] we should just use grep and grep -v [18:04:18] that never fails [18:04:40] nah, i just need to start with ^ [18:04:46] and put the entire cmdline in there [18:04:49] _joe_: did the logging location for the job runner change? [18:05:06] akosiaris: i pilled your mailbox, feel free to ignore, or review. [18:05:22] mutante: also see -u argument, it might help [18:05:27] <_joe_> AaronSchulz: /var/log/mediawiki/ [18:05:31] right, just found it [18:05:37] matanya: yeah, I am witnessing that right now :-) [18:05:44] <_joe_> AaronSchulz: sorry I merged it this morning [18:06:24] the "Runner loop 0 process in slot 0 timed out:" errors look bogus...otherwise fine [18:06:40] <_joe_> yes saw those [18:06:42] (03PS1) 10Filippo Giunchedi: releases: add reprepro repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/146826 [18:06:52] akosiaris: yea, cool, -u list is a good one too [18:08:29] (03PS2) 10Scottlee: Fixed Puppet 3 lint issues relating to git ipython ishmael and coredb_mysql modules. [operations/puppet] - 10https://gerrit.wikimedia.org/r/146685 [18:09:03] jouncebot: refresh [18:09:05] I refreshed my knowledge about deployments. [18:09:17] jouncebot: yeah, cuz you missed that I moved the zero up to 9am [18:09:57] greg-g: talking to a bot is not a kind of a turing test [18:10:25] matanya: he always just ignores me, typical [18:10:45] bots, what you expect [18:11:06] he is not even telling people to deploy in a nice request [18:11:23] (03CR) 10Andrew Bogott: [C: 032] Fixed Puppet 3 lint issues relating to git ipython ishmael and coredb_mysql modules. [operations/puppet] - 10https://gerrit.wikimedia.org/r/146685 (owner: 10Scottlee) [18:11:48] (03CR) 10Filippo Giunchedi: "the basic functionality is there (i.e. signed uploads and signed repository) let me know what you think!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146826 (owner: 10Filippo Giunchedi) [18:12:35] greg-g: "The time is nigh" is not friendly. can't he say: sir x, please, your highness, deploy y, at your service jounce [18:12:57] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Wed Jul 16 18:12:47 UTC 2014 [18:13:10] (03CR) 10Andrew Bogott: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146685 (owner: 10Scottlee) [18:13:16] matanya: the deploy butler [18:13:21] (03CR) 10Ori.livneh: "This looks great. Just one small suggestion: do what we did with the apache-config migration. That is: include sites.pp everywhere, and le" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 (owner: 10Giuseppe Lavagetto) [18:13:24] ^ _joe_ [18:13:44] matanya: It's not even correct it usually only pokes after the window started :D [18:13:57] It should have a snooze function, btw :D [18:14:21] that would be great greg-g. if he can look like jenkins, even better. we should send him to James_Fto learn some british manners [18:14:30] :) [18:14:33] * James_F grins. [18:14:42] Manners are bred not taught. [18:14:43] jouncebot: set honorary_title matanya your_highness [18:14:44] ;-) [18:26:04] (03PS1) 10Dzahn: mailman monitor-only check procs by list user [operations/puppet] - 10https://gerrit.wikimedia.org/r/146835 [18:32:12] _joe_: are you up for deploying the apache sites change? [18:32:19] i think it's good apart from the small thing i pointed out [18:32:40] <_joe_> ori: in a few [18:32:44] cool [18:33:05] (03CR) 10Dzahn: [C: 032] mailman monitor-only check procs by list user [operations/puppet] - 10https://gerrit.wikimedia.org/r/146835 (owner: 10Dzahn) [18:33:22] (03PS1) 10Aaron Schulz: Set the statsd server for the jobrunners. [operations/puppet] - 10https://gerrit.wikimedia.org/r/146837 [18:37:24] (03PS5) 10Mwalker: New Variables for OCG Service [operations/puppet] - 10https://gerrit.wikimedia.org/r/144610 [18:41:26] (03PS6) 10Mwalker: New Variables for OCG Service [operations/puppet] - 10https://gerrit.wikimedia.org/r/144610 [18:41:42] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [18:42:42] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [18:43:01] (03CR) 10Giuseppe Lavagetto: "Including mediawiki::web::sites would trigger an apache reload everywhere, which is not a problem in general but not during such a delicat" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 (owner: 10Giuseppe Lavagetto) [18:45:01] hey q for some .deb folks [18:45:08] maybe godog [18:45:12] or akosiaris [18:45:18] (03CR) 10Ori.livneh: [C: 031] "@_joe_: OK, makes sense then." [operations/puppet] - 10https://gerrit.wikimedia.org/r/146082 (owner: 10Giuseppe Lavagetto) [18:45:25] I need a .deb package of libjs-extjs 2.2 [18:45:33] for a CDH web gui to work [18:46:02] mary has backported the 3.x packaging to work with 2.2 [18:46:11] we've got a .deb that works for us [18:46:31] 3.x is available in ubuntu/debian already [18:46:41] can we just build this .deb and then add it to our apt repo? [18:46:58] or do we need to have a operations/debs/libjs-extjs repository with this in it, complete with git-buildpackage, etc.? [18:49:04] ottomata: is it something that we are going to maintain/change? or used elsewhere? anyway I'd say if it is a "one off" we can upload the .deb and .dsc [18:49:22] its pretty mucha one off, its just so this oozie web GUI will work [18:49:38] unless Oozie upgrades sometime in the future, and we upgrade oozie, then it will likely stay the same [18:49:49] i don't think much work is going into the oozie web console, its more of just a nice way to see jobs in oozie [18:50:37] http://dom.as/2012/06/26/memsql-rage/ [18:50:41] "no I just switched to a real database." ... lol [18:52:19] ottomata: ye I think it is fine, bear mind though that if something else needs it our version will be preferred over the ubuntu one [18:52:54] ottomata: perhaps we should rename it ? [18:53:05] ottomata: oozie-extjs (or some such) [18:53:23] let's jsut put the version name in the package [18:53:24] (03PS3) 10Cmjohnson: removing osm-db1001/2 entries changing all to labsdb1006/7 [operations/dns] - 10https://gerrit.wikimedia.org/r/146822 [18:53:25] s'ok? [18:53:32] libjs-extjs2 [18:53:32] ? [18:53:37] wfm [18:53:39] k [18:54:02] cool, ja, mary, rebuild, mabye see if you can find the default css theme somewhere and just add it [18:54:03] * gwicke is following the 'how to get a deb into production' discussion with keen interest [18:54:06] haha [18:54:08] k [18:54:09] uh oh! [18:54:12] :p [18:54:22] ;) [18:54:27] gwicke, note that akosiaris has not chimed in, and faidon is not here [18:54:28] haha [18:54:32] (03PS4) 10Cmjohnson: removing osm-db1001/2 entries changing all to labsdb1006/7 [operations/dns] - 10https://gerrit.wikimedia.org/r/146822 [18:55:03] also note that this is a simple one of crappy deb for something that will probably never change [18:55:10] well, its not even crappy, its a backport! [18:55:18] wait, that is not the proper term... [18:55:24] yeah, I'm interested in similar stuff [18:55:26] like node 0.11 [18:55:51] node is a server and set of libs that many folks could use [18:55:53] (03CR) 10Cmjohnson: [C: 032] removing osm-db1001/2 entries changing all to labsdb1006/7 [operations/dns] - 10https://gerrit.wikimedia.org/r/146822 (owner: 10Cmjohnson) [18:56:40] ottomata: yeah, although if packaged similar to python with a node-0.11 executable it'd be something that can coexist with regular node [18:57:35] anyway, back to lurking ;) [19:07:30] (03CR) 10Jgreen: [C: 032 V: 031] New Variables for OCG Service [operations/puppet] - 10https://gerrit.wikimedia.org/r/144610 (owner: 10Mwalker) [19:09:36] (03PS1) 10Aaron Schulz: Move more runners over to the new job loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/146847 [19:13:55] (03CR) 10GWicke: "> the "processincoming" cron is commented out, I think it'd be better to have people run it via sudo after the upload, on the ground that " [operations/puppet] - 10https://gerrit.wikimedia.org/r/146826 (owner: 10Filippo Giunchedi) [19:14:16] (03CR) 10GWicke: [C: 031] releases: add reprepro repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/146826 (owner: 10Filippo Giunchedi) [19:21:28] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [19:21:43] !log temp disabling puppet on analytics1027 [19:21:46] Logged the message, Master [19:22:31] (03PS1) 10Bsitu: Enable job queue to process notification in test/test2 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146849 [19:25:07] (03PS1) 10coren: Tool Labs: first draft of the bigbrother daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/146851 [19:25:29] (03PS2) 10Giuseppe Lavagetto: Move more runners over to the new job loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/146847 (owner: 10Aaron Schulz) [19:25:57] <_joe_> AaronSchulz: ok to merge ^ [19:26:46] (03CR) 10coren: [C: 032] "Noop at worse if it fails" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146851 (owner: 10coren) [19:28:24] <^d> Coren: "bigbrother daemon?" [19:28:27] <^d> NSA side project? [19:28:29] <^d> ;-) [19:29:40] (03PS3) 10Giuseppe Lavagetto: Move more runners over to the new job loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/146847 (owner: 10Aaron Schulz) [19:29:54] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146847 (owner: 10Aaron Schulz) [19:30:41] <_joe_> come on jenkins [19:38:18] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures [19:39:25] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:43:49] <_joe_> tin is not happy [19:45:01] <^d> What's up? [19:45:11] puppet [19:49:42] (03PS1) 10BBlack: Add ocg service IP in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/146858 [19:50:08] (03CR) 10BBlack: [C: 032] Add ocg service IP in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/146858 (owner: 10BBlack) [19:50:13] (03PS1) 10Rush: route incoming mail for phabricator.wm.org to iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/146859 [19:51:13] (03PS1) 10BBlack: Offline Content Generator LVS config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146860 [19:53:35] (03CR) 10Tim Landscheidt: Tool Labs: first draft of the bigbrother daemon (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146851 (owner: 10coren) [19:54:13] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:54:49] (03CR) 10coren: Tool Labs: first draft of the bigbrother daemon (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146851 (owner: 10coren) [19:56:51] !log reenabling puppet on analytics1027 [19:56:57] Logged the message, Master [19:59:33] (03PS2) 10BBlack: Offline Content Generator LVS config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146860 [20:00:04] gwicke, subbu, cscott: The time is nigh to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140716T2000) [20:07:17] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures [20:08:51] (03PS2) 10Aaron Schulz: Set the statsd server for the jobrunners [operations/puppet] - 10https://gerrit.wikimedia.org/r/146837 [20:09:13] (03CR) 10jenkins-bot: [V: 04-1] Set the statsd server for the jobrunners [operations/puppet] - 10https://gerrit.wikimedia.org/r/146837 (owner: 10Aaron Schulz) [20:09:20] (03PS3) 10BBlack: Offline Content Generator LVS config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146860 [20:11:41] (03PS4) 10BBlack: Offline Content Generator LVS config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146860 [20:13:18] can someone in ops help me with parsoid deploy? [20:14:40] andrewbogott, bd808 ? [20:14:51] only 23 of 24 minions fetched and checked out code with git deploy [20:15:14] subbu: I can log in and look at things for you, but I don't know much about what git deploy is supposed to do. [20:15:17] subbu: you could try 'r' for retry [20:15:21] (03CR) 10BBlack: [C: 032] Offline Content Generator LVS config [operations/puppet] - 10https://gerrit.wikimedia.org/r/146860 (owner: 10BBlack) [20:15:36] i tried and it didn't work .. let me resync and see if that changes anything. [20:15:44] don't know which one failed. [20:15:46] detailed report should show error code [20:16:21] ah, ok, let me try. [20:17:50] after hitting 'r' a few times, it went through. [20:22:04] !log deploy parsoid 060dcb54 [20:22:10] Logged the message, Master [20:27:54] (03PS3) 10Aaron Schulz: Set the statsd server for the jobrunners [operations/puppet] - 10https://gerrit.wikimedia.org/r/146837 [20:31:20] (03PS1) 10Ottomata: Wrap ganglia::plugin::python { 'diskstat': } in if !defined in hadoop and kafka roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/146905 [20:33:08] gwicke, cscott, subbu, are you deploying now? [20:33:26] all done. [20:33:56] (03CR) 10Ottomata: [C: 032 V: 032] Wrap ganglia::plugin::python { 'diskstat': } in if !defined in hadoop and kafka roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/146905 (owner: 10Ottomata) [20:34:38] greg-g, do you think i could get our stuff out now? [20:34:51] parsoid seems to be done [20:35:18] yurikJerusalem: yeah, godspeed [20:35:42] jenkins looks better [20:36:03] hope so ) [20:36:57] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: (null) [20:37:22] <_joe_> ocg? [20:37:24] (03PS1) 10Ottomata: Use hadoop config class in oozie role to find oozie host in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/146913 [20:37:25] wha? [20:37:43] subbu: you were deploying from tin? [20:37:47] <_joe_> jgage: do you know what that is? [20:37:55] andrewbogott, yes. [20:38:04] _joe_ no, looking.. [20:38:11] I had the same question, "what is it" [20:38:17] <_joe_> it's parsoid-related I guess [20:38:23] <_joe_> given we've just had a release [20:38:26] Hm, puppet is failing on tin due to "salt-call saltutil.sync_all" [20:38:48] it times out during "Loading fresh modules for state activity" [20:39:06] andrewbogott, anything i should be concerned about? [20:39:17] subbu: no, I think the issue predates your deploy [20:39:28] k [20:39:34] I just wonder what it's about. And I'm a little surprised your deploy didn't hang in the same way [20:39:46] <_joe_> andrewbogott: can you look into that? [20:39:49] maybe that explains the syncing issues i had a few times before it went through? [20:40:00] _joe_: yep, looking now [20:40:18] no, ocg is me [20:40:28] (03CR) 10Ottomata: [C: 032 V: 032] Use hadoop config class in oozie role to find oozie host in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/146913 (owner: 10Ottomata) [20:40:30] ignore it, it's brand-new and in testing, and apparently the icinga check is broken [20:40:49] ok [20:40:59] what's ocg stand for? [20:40:59] <_joe_> bblack: oh man :) [20:41:05] <_joe_> we got paged :P [20:41:12] yeah I saw :) [20:41:25] too bad new services aren't automatically downtimed for a hour! [20:42:38] got paged about OCG [20:42:46] ok, that :) [20:42:55] (03PS1) 10Ottomata: Use hadoop config class in hive and hue roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/146914 [20:42:59] ACKNOWLEDGEMENT - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: (null) Brandon Black New service, not prod-critical just yet, and icinga check is somehow broken :P [20:42:59] jgage: offline content generator [20:43:14] thanks mutante [20:43:22] (03CR) 10Ottomata: [C: 032 V: 032] Use hadoop config class in hive and hue roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/146914 (owner: 10Ottomata) [20:43:32] * apergos ignores the page that just arrived too [20:43:34] jgage: "the new pdf" [20:43:47] yeah ocg stands for Offline Content Generator [20:44:07] ok good to know [20:44:10] its not quite the new pdf [20:44:13] and also, the service check should have worked, it's kind of annoying that I can't yet see what I did wrong [20:44:18] it was a similar service though [20:44:23] dunno if thats changeD? [20:44:40] it will be the new pdf [20:44:43] ahh, cool [20:44:44] it's not production yet [20:44:45] gtk [20:45:23] (but close enough that they want prod infrastructure like LVS figured out and ready) [20:49:08] (03PS1) 10Alexandros Kosiaris: 0.8.0-1 debianization release [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/146918 [20:49:10] (03PS1) 10Alexandros Kosiaris: Upping revision to 0.8.0-2 with support for setting ulimit open files [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/146919 [20:49:12] (03PS1) 10Alexandros Kosiaris: Update to 0.8.1.1 [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/146920 [20:49:26] there are also some other smaller services that we could piggy-back on the pdf machines [20:49:46] like citoid & mathoid [20:52:07] RobH, bblack ^ [20:52:10] (03PS1) 10Dzahn: checkcommand to test for strings on https URLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/146923 [20:53:13] afaik it's supposed to create other formats besides PDF that are for offline reading.. i think EPUB (and more?).. hence it has a generic name [20:53:35] yup [20:54:00] (03CR) 10Alexandros Kosiaris: "cherry-picked from debian-0.8.0 as it was missing on the debian branch. Actual changes from Andrew Otto" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/146918 (owner: 10Alexandros Kosiaris) [20:54:23] (03CR) 10Alexandros Kosiaris: "cherry-picked from debian-0.8.0 as it was missing on the debian branch. Actual changes from Andrew Otto" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/146919 (owner: 10Alexandros Kosiaris) [20:55:13] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:55:37] (03PS2) 10Dzahn: checkcommand to test for strings on https URLs [operations/puppet] - 10https://gerrit.wikimedia.org/r/146923 [20:56:01] !log yurik Synchronized php-1.24wmf12/extensions/: update to JsonConfig, ZeroBanner, ZeroPortal (duration: 04m 54s) [20:56:06] Logged the message, Master [20:58:23] (03PS1) 10Mwalker: Enable Petition Extension on Labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146925 [20:58:39] (03CR) 10Dzahn: [C: 04-1] "with these commands it doesn't make a difference if you check "pipermail/wikimedia-l" or "pipermail/nonexistent-foo", they are both OK. Ju" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 (owner: 10Matanya) [20:59:04] greg-g, two questions; 1) has the petition extension passed all its reviews; and 2) if it has, can this go out in the swat: https://gerrit.wikimedia.org/r/146925 [20:59:35] (03PS1) 10Rush: mysql @var template linting [operations/puppet] - 10https://gerrit.wikimedia.org/r/146927 [21:00:09] (03CR) 10Dzahn: [C: 032] "for add. mailman monitoring of archive/listinfo URL" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146923 (owner: 10Dzahn) [21:00:59] mutante: fix my patch as well, if you are at it already [21:01:44] matanya: yes, on it [21:01:54] !log yurik Synchronized php-1.24wmf13/extensions/: update to JsonConfig, ZeroBanner, ZeroPortal (duration: 04m 53s) [21:01:59] Logged the message, Master [21:02:10] thanks mutante [21:03:14] (03PS1) 10BBlack: fix semicolon escaping in ocg cleanup commands [operations/puppet] - 10https://gerrit.wikimedia.org/r/146928 [21:04:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Looks nice, minor inline comments" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145018 (owner: 10ArielGlenn) [21:07:27] (03PS3) 10BBlack: Kill $project-lb.$site.wikimedia.org and free IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/140149 (owner: 10Faidon Liambotis) [21:07:43] (03PS4) 10Dzahn: mailman: monitor web and archive access [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 (owner: 10Matanya) [21:07:59] (03CR) 10BBlack: [C: 032] fix semicolon escaping in ocg cleanup commands [operations/puppet] - 10https://gerrit.wikimedia.org/r/146928 (owner: 10BBlack) [21:08:13] !log clearing Magog the Ogre's watchlist on enwp per request (173668 entries) [21:08:18] Logged the message, Master [21:08:55] (03CR) 10Chad: [C: 031] retab role/gerrit.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/146087 (owner: 10Dzahn) [21:10:28] (03PS1) 10Rush: phabricator inbound email handler [operations/puppet] - 10https://gerrit.wikimedia.org/r/146932 [21:10:33] mwalker: 1) just asked on the security review one https://bugzilla.wikimedia.org/show_bug.cgi?id=65850 [21:10:40] 2) if Chris says yes, doit [21:14:10] (03CR) 10jenkins-bot: [V: 04-1] phabricator inbound email handler [operations/puppet] - 10https://gerrit.wikimedia.org/r/146932 (owner: 10Rush) [21:17:16] (03PS1) 10Mwalker: Start OCG from a single location / script [operations/puppet] - 10https://gerrit.wikimedia.org/r/146934 [21:17:23] (03CR) 10Dzahn: "this translates to actual command lines like:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 (owner: 10Matanya) [21:18:29] matanya: ugh, so now it works for the archives, but not for /pipermail/ [21:18:47] matanya: because.. pipermail links actively redirect from https to http [21:18:57] not the other way around [21:19:04] that should be fixed too [21:19:18] arg, i meant "works for the listinfo page but not for the arcives" [21:19:52] matanya: you know, there is a patch ..just happens to be there .heh [21:20:11] we can split this patch [21:20:19] merge the part that is working [21:20:20] because of: https://gerrit.wikimedia.org/r/#/c/145616/ [21:20:27] he wanted that for STS [21:20:42] which is good [21:20:45] before we even had the incident [21:20:47] yea [21:21:03] but mailman cgi doesn't do STS, yeah ? [21:21:06] (03PS2) 10Mwalker: Start OCG from a single location / script [operations/puppet] - 10https://gerrit.wikimedia.org/r/146934 [21:21:48] matanya: jzerebecki's patch is a prerequisite for https://gerrit.wikimedia.org/r/#/c/145500/ [21:25:06] I see. limbo. [21:25:39] (03PS5) 10Dzahn: mailman: monitor listinfo page (of wikimedia-l) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 (owner: 10Matanya) [21:27:39] (03CR) 10Dzahn: [C: 032] mailman: monitor listinfo page (of wikimedia-l) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 (owner: 10Matanya) [21:28:02] thanks a lot [21:30:04] matanya: :) i said it wrong _again_, haha, in the commit message [21:30:13] https->http [21:30:16] that is what it does [21:30:33] Coren: RobH right, so I guess we wait for mark to say 'yeah, ok' on the RT thread before the actual provisioning starts? [21:30:37] yeah, people get it awyway [21:30:43] https://lists.wikimedia.org/pipermail/wikimedia-l/ [21:30:45] hi AnnaKoval [21:32:36] (03PS2) 10Rush: phabricator inbound email handler [operations/puppet] - 10https://gerrit.wikimedia.org/r/146932 [21:33:07] (03CR) 10Dzahn: "works at https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=sodium&service=mailman+list+info" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146762 (owner: 10Matanya) [21:34:19] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures [21:34:50] CUSTOM - mailman list info on sodium is OK: HTTP OK: HTTP/1.1 200 OK - 10533 bytes in 0.092 second response time [21:35:18] (03CR) 10BBlack: [C: 032] "I've sniffed for related DNS and HTTP traffic on the authservers and LVSes, respectively. There is some traffic for these hostnames and I" [operations/dns] - 10https://gerrit.wikimedia.org/r/140149 (owner: 10Faidon Liambotis) [21:38:08] (03PS2) 10Cmjohnson: Repurposing osm-db100{1,2} as labsdb100{6,7} [operations/puppet] - 10https://gerrit.wikimedia.org/r/137691 (owner: 10Alexandros Kosiaris) [21:43:43] (03PS1) 10Dzahn: mailman - monitor archive page (of wikimedia-l) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146943 [21:44:35] (03CR) 10Cmjohnson: [C: 032] Repurposing osm-db100{1,2} as labsdb100{6,7} [operations/puppet] - 10https://gerrit.wikimedia.org/r/137691 (owner: 10Alexandros Kosiaris) [21:50:45] !log Updated job runners to 186b9b33 [21:50:50] Logged the message, Master [21:51:20] greg-g: should i add you as gerrit reviewer to all changes that add new monitoring after incidents ? [21:51:55] greg-g: ..not to actually review but because you want to know when they are done? [21:52:25] mutante: uh, sure, why not [21:54:15] ok! [21:55:00] greg-g: f.e. for lists/sodium, we did already have a check.. just .. it did not actually trigger [21:55:06] explanations on gerrit [21:55:25] :) [21:55:47] (03PS2) 10BBlack: Add text-lb.wikimedia.org and switch CNAMEs to it [operations/dns] - 10https://gerrit.wikimedia.org/r/140391 (owner: 10Faidon Liambotis) [21:56:24] (03CR) 10Dzahn: [C: 032] mailman - monitor archive page (of wikimedia-l) [operations/puppet] - 10https://gerrit.wikimedia.org/r/146943 (owner: 10Dzahn) [21:57:49] (03CR) 10Dzahn: "when this gets merged, please adjust I77c2f56eedd86271 or remind me to" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145616 (owner: 10JanZerebecki) [21:59:00] (03PS3) 10BBlack: Add text-lb.wikimedia.org and switch CNAMEs to it [operations/dns] - 10https://gerrit.wikimedia.org/r/140391 (owner: 10Faidon Liambotis) [22:00:25] (03CR) 10BBlack: [C: 032] Add text-lb.wikimedia.org and switch CNAMEs to it [operations/dns] - 10https://gerrit.wikimedia.org/r/140391 (owner: 10Faidon Liambotis) [22:01:22] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [22:01:54] Notice: Finished catalog run in 44.86 seconds [22:02:21] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:02:32] icinga-wm: yea, right [22:04:07] (03CR) 10Dzahn: "works at https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=sodium&service=mailman+archives" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146943 (owner: 10Dzahn) [22:04:15] greg-g, found a few minor issues in prod, when can i push them? [22:04:50] yurikJerusalem: what's up? [22:05:09] greg-g, incorrect banners showing to those whom it shouldn't [22:05:30] yurikJerusalem: go for it, nothing for 55 minutes [22:05:32] greg-g, https://gerrit.wikimedia.org/r/#/c/146948/ [22:05:35] ok [22:08:03] (03PS4) 10BBlack: Kill $project-lb.wikimedia.org IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/140136 (owner: 10Faidon Liambotis) [22:08:17] mwalker: per the bug, not yet re petition [22:12:08] (03PS1) 10Gergő Tisza: Remove MediaViewer survey-related settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146951 [22:13:10] (03PS2) 10BBlack: beta::natfix removal step 1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146090 [22:13:17] (03CR) 10BBlack: [C: 032 V: 032] beta::natfix removal step 1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/146090 (owner: 10BBlack) [22:13:56] (03PS1) 10Andrew Bogott: Increase timeout for run of deployment_server_sync_all. [operations/puppet] - 10https://gerrit.wikimedia.org/r/146952 [22:15:48] (03CR) 10Andrew Bogott: [C: 032] Increase timeout for run of deployment_server_sync_all. [operations/puppet] - 10https://gerrit.wikimedia.org/r/146952 (owner: 10Andrew Bogott) [22:17:00] PROBLEM - puppet last run on search1022 is CRITICAL: CRITICAL: Puppet has 1 failures [22:18:26] !log yurik Synchronized php-1.24wmf12/extensions/ZeroBanner: (no message) (duration: 04m 31s) [22:18:31] Logged the message, Master [22:24:08] (03CR) 10Dzahn: [C: 032] "tested with puppet-compiler:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/146087 (owner: 10Dzahn) [22:27:12] !log restarted jobrunner service on all job runners [22:27:16] Logged the message, Master [22:27:17] ^ AaronSchulz [22:27:18] YuviPanda: how about making the change on dynamicproxy? [22:27:27] hmm? [22:27:49] YuviPanda: i'd do this: https://gerrit.wikimedia.org/r/#/c/146466/ [22:28:09] ah, sure I can apply puppet + watch, mutante [22:28:37] mutante: want me to get to the hosts now? [22:28:41] YuviPanda: that would be cool, it's what we also removed on nginx cluster [22:28:51] the ciphers were copied from there [22:28:55] YuviPanda: sure :) [22:29:14] mutante: ok, I'm on 'em both [22:29:18] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:29:50] !log yurik Synchronized php-1.24wmf13/extensions/ZeroBanner: (no message) (duration: 03m 55s) [22:29:56] Logged the message, Master [22:30:37] (03PS3) 10Dzahn: dynamicproxy - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146466 [22:30:49] YuviPanda: arr, rebasing, something else got changed earlier [22:30:53] just a sec [22:30:57] mutante: ok [22:32:58] RECOVERY - puppet last run on search1022 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:34:40] !log temporarily fixed puppet on tin by restarting salt-master and salt-minion. A proper fix would involve upgrading to a salt version that fixes https://github.com/saltstack/salt/issues/6306 [22:34:44] Logged the message, Master [22:35:05] (03CR) 10Dzahn: [C: 032] dynamicproxy - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146466 (owner: 10Dzahn) [22:36:21] YuviPanda: now [22:36:37] mutante: doing on general webproxy now [22:36:51] graceful should be enough [22:36:58] (unless puppet does it anyways) [22:37:38] mutante: done. can you check? [22:38:23] YuviPanda: my labs project still reachable behind it [22:38:24] mutante: use https://utrs.wmflabs.org/ for testing [22:38:27] cool [22:38:48] mutante: on toollabs now [22:40:44] mutante: done for toollabs as well [22:40:44] woot [22:41:14] YuviPanda: thanks!:) [22:41:21] mutante: yw! :D [22:41:35] mutante: btw, eventually I want to just use the nginx::ssl thing in the nginx module, so these won't have to be done twice [22:41:38] YuviPanda: https://www.ssllabs.com/ssltest/analyze.html?d=utrs.wmflabs.org [22:41:41] straight A [22:41:53] :D YEAAAA! :D [22:42:56] that change was noop, that cipher was already disabled, but all is good [22:43:09] right [22:43:33] YuviPanda: i was wondering why we still edit ./templates/nginx/ for prod [22:43:40] and not module/nginx/templates/ [22:43:42] dr0ptp4kt, MaxSem, that patch didn't go through [22:43:47] nginx module isn't used in prod, I think? [22:44:04] yea, i was just thinking it'd be nice if it was [22:44:06] just one module [22:44:11] dr0ptp4kt, MaxSem https://gerrit.wikimedia.org/r/#/c/146948/ [22:44:14] mutante: yup [22:44:14] yurikR, recreate the submodule on tin? [22:44:16] waiting for it to merge [22:44:20] mutante: I moved dymnamicproxy to it a while ago [22:45:10] mutante: would be nice to get 100 on all of 'em tho https://www.ssllabs.com/ssltest/analyze.html?d=tools.wmflabs.org [22:45:25] ergh, jerkins is just stuck [22:45:33] you can merge directly [22:46:56] MaxSem, back in a sec [22:47:17] yurikR, btw you must round up in 14 minutes [22:47:42] YuviPanda: of course, yea ;) i don't think i have seen any 100/100/100 yet [22:57:27] PROBLEM - Recursive DNS on 91.198.174.6 is CRITICAL: Domain www.wikipedia.org was not found by the server [22:58:02] bblack: ^ouch ? [22:59:05] hmmm [22:59:35] oh, somebody reinstalling it? [22:59:38] that's nescio [23:00:00] Yeah I have someone in #wikipedia saying they're having issues resolving some wikipediae [23:00:04] RoanKattouw, mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140716T2300) [23:00:06] I don't know about reinstalling. that is nescio, but it should be getting correct resolution [23:00:06] https://dpaste.de/Vp8g [23:00:32] bblack: i said reinstalling because of https://rt.wikimedia.org/Ticket/Display.html?id=7876 [23:01:14] it's pdns, restart it? [23:01:27] RECOVERY - Recursive DNS on 91.198.174.6 is OK: DNS OK: 0.104 seconds response time. www.wikipedia.org returns 91.198.174.192 [23:01:33] cool [23:02:16] did someone restart it? [23:02:24] I was still trying to find the problem [23:02:32] i did not [23:02:44] The reporter says things are working. [23:02:48] (because there was probably nothing wrong with nescio, and whatever was going on there might be happening to a lot of other caches) [23:03:11] Is...anyone SWATting? [23:03:28] Oh, no patches. [23:03:31] * marktraceur walks away [23:04:35] oh ok, I think I've come up with a plausible scenario now [23:05:07] (all, nzerob_ reported the issue, nzerob_, bblack is investigating) [23:05:29] yurikR: any update? [23:05:39] to recap the relevant bits of the DNS change today: essentially a bunch of our CNAMEs changed to point at a new name. e.g. www -> mediawiki-lb.wikimedia.org. became www -> text-lb.wikimedia.org. [23:06:18] in the exact same changeset, text-lb.wikimedia.org was added as a new hostname [23:06:45] greg-g, running a few min late [23:06:50] I think what I overlooked there is perhaps that caches could have fetched conflicting results *during* the process of updating nameservers [23:07:02] (although I would expect that to be exceedingly rare) [23:07:03] yurikR2: ? still going on your deploy from 55 mins ago? [23:07:24] greg-g, yes, between fighing with jenkins :( [23:07:40] e.g. the patch goes to ns0 ahead of ns1 by 5 seconds. during those 5 seconds, someone queries www.wikipedia.org from ns0, gets a pointer to text-lb.wikimedia.org., then queries ns1 for that and it doesn't exist yet [23:08:10] Sounds plausible [23:08:11] if I had thought of that, I should have further split that patch and deployed them separated by a TTL or two [23:08:35] (we had already gone through some revisions splitting this change to avoid issues like this, but I think we missed this case) [23:09:34] in any case, this is a negative-TTL issue, so even if that were to occur, it should have resolved itself for a properly-behaving cache within 10 minutes (our negative-TTL value) [23:10:52] and so it did? (resolve itself even faster, just 3 minutes or something) [23:11:14] well, it did on nescio. I'm still kinda waiting to see if more users chime in or not at this point. [23:11:51] gotcha, i just saw icinga-wm, had not even noticed user reports yet [23:11:51] bblack: https://meta.wikimedia.org/w/index.php?diff=9220354&oldid=9219953&rcid=5435971 seems to exist [23:11:57] (from -tech) [23:12:45] oh that sounds like something else, based on the timestamps [23:12:53] his reports are an hour apart [23:13:00] Huh [23:13:02] (which is our positive TTL) [23:13:02] !log yurik Synchronized php-1.24wmf12/extensions/: update to JsonConfig, ZeroBanner, ZeroPortal (duration: 04m 14s) [23:13:06] Just magically coincided? [23:13:06] Logged the message, Master [23:13:09] Weird. [23:13:19] hmm, he reported it an hour before ? [23:13:22] I don't see anything anywhere else [23:13:23] no, I pushed two separate DNS changes today [23:13:45] the negative-TTL stuff above was about the most-recent one, it could be that the earlier one caused a 1hr problem for some people as well [23:16:01] even the timestamp for that is off, though. maybe the meta timestamp on those edits is wrong? [23:16:12] or gitblit is wrong [23:16:23] supposedly the first change merged at 21:06 [23:17:57] !log yurik Synchronized php-1.24wmf13/extensions/: update to JsonConfig, ZeroBanner, ZeroPortal (duration: 04m 03s) [23:18:02] Logged the message, Master [23:18:10] greg-g, done for now, but the testing still going on [23:18:22] :) k [23:18:51] (for reference the two DNS commits are: https://gerrit.wikimedia.org/r/#/c/140149/ -> https://gerrit.wikimedia.org/r/#/c/140391/ ) [23:19:20] (03CR) 10Dzahn: [C: 032] bugzilla - remove DHE ciphers [operations/puppet] - 10https://gerrit.wikimedia.org/r/146461 (owner: 10Dzahn) [23:28:53] bblack, i spoke with varnish ppl today, they clarified on the Vary behavior and showed a cool way to run tests against varnish. [23:29:17] which vary behavior are you talking about? [23:31:45] bblack, just double checking if different sets of Vary headers would work on the same page depending on the value of one of the headers [23:32:17] you mean if you sent different Varys back from the server at different times? [23:33:44] I would expect the behavior to roughly correlate with objectid=hash(url, other stuff, content of varied-headers matching this object's Vary header), and each objectid being distinct in the cache [23:34:18] although then I'm not sure how matching is done once you have the same URL in the cache with two different sets of contributing varied headers [23:35:11] (are you planning on distinct varys between X-CS=ON and the rest?) [23:44:59] i'm having multiple requests return "server not found" [23:45:01] at random [23:47:10] bblack, no, this is mostly about X-CS=ON, and not having any other headers (protocol & forwarded-by), vs X=CS=something and setting vary on both protocol & forwarded [23:47:37] in my mind it should work, but i wanted to double check with varnish devs [23:47:58] thing is, hash is not dependent on the vary header [23:48:06] rschen7754: can you be more specific? [23:48:37] bblack: how so? [23:48:53] what requests are you sending? [23:49:08] (03PS5) 10Ricordisamoa: minor changes to InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 [23:49:19] i'm hitting refresh on tabs i have open, and clicking on diffs [23:49:30] sometimes it works, and sometimes it refuses to for several minutes [23:50:15] can you paste some URLs on the ones that sometimes-refuse? [23:50:45] yurikR2, are you done deploying? [23:50:52] (and can you characterize the pattern of "refuses for several minutes?" how many times has this pattern repeated, how long does the period of refusal last, how long does it work fine after that, etc?) [23:51:11] https://en.wikipedia.org/wiki/Special:Watchlist [23:51:27] MaxSem, yep [23:51:34] MaxSem, thx for spoting, all's good [23:51:41] https://en.wikipedia.org/w/index.php?diff=617244358&oldid=615753929 [23:52:00] greg-g, I'm going to push a no-op cleanup of PrivateSettings [23:52:38] for about five minutes it works, and then for the next five it doesn't [23:52:54] rschen7754: do you know DNS stuff / can do basic ping/nslookup commands? [23:52:59] yes [23:53:19] any chance you've got nslookup results on one of the failing domains' URLs during the bad period? [23:53:27] (or can grab them on the next iteration?) [23:53:34] ping: cannot resolve en.wikipedia.org: Unknown host [23:54:36] hmmmm [23:55:06] rschen7754: you don't happen to be on a linux/mac host that has "dig" do you? [23:55:15] i'm on mac [23:55:29] !log maxsem Synchronized private: Clean up old mobile cruft (duration: 00m 05s) [23:55:34] Logged the message, Master [23:55:54] can you try "dig en.wikipedia.org" ? [23:56:19] now it seems to work [23:57:20] well you said it was intermittent [23:57:37] can you capture the dig output while it's working, though, and then try to capture it again the next time it's not? [23:57:43] ok [23:58:52] I did deploy two DNS changes today, so it's almost certainly related. However, the only real issue I could find in those changes should have self-resolved inside of 10 minutes for the few it might have affected. [23:59:36] (03PS6) 10Ricordisamoa: minor changes to InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 [23:59:37] at this point I'm trying to figure out if we're dealing with some strange case of certain remote caches at ISPs misbehaving in the wake of the change, or if there's some other real issue in that pair of changes that hasn't been figured out yet.