[00:00:04] <jouncebot>	 aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150505T0000). Please do the needful.
[00:01:03] <grrrit-wm>	 (03PS1) 10Negative24: phabricator: Remove obsolete configs [puppet] - 10https://gerrit.wikimedia.org/r/208848 
[00:08:41] <grrrit-wm>	 (03PS1) 10Yuvipanda: zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 
[00:08:44] <yuvipanda>	 ottomata: ^ :)
[00:10:37] <grrrit-wm>	 (03CR) 10Yuvipanda: "Note that I'm experimenting with mesos and other frameworks that use Zookeeper internally and hence am interested in making this generic :" [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda)
[00:14:51] * aude waves
[00:15:24] <aude>	 running scripts and then shall deploy
[00:17:52] <hoo>	 aude: Good luck... I'm heading to bed
[00:20:09] <aude>	 hoo: good night
[00:20:14] <aude>	 scripts are done
[00:20:30] <hoo>	 ok :)
[00:22:01] <grrrit-wm>	 (03CR) 10Aude: [C: 032] Enable use of subscription tracking on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208675 (owner: 10Aude)
[00:22:10] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable use of subscription tracking on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208675 (owner: 10Aude)
[00:24:08] <logmsgbot>	 !log aude Synchronized wmf-config/Wikibase.php: Enable Wikibase subscription tracking (duration: 00m 12s)
[00:24:18] <morebots>	 Logged the message, Master
[00:24:33] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Make checkers submit hosts as well [puppet] - 10https://gerrit.wikimedia.org/r/208853 
[00:24:46] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Make checkers submit hosts as well [puppet] - 10https://gerrit.wikimedia.org/r/208853 (owner: 10Yuvipanda)
[00:25:20] * aude is done, assuming no problems
[00:27:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1197 is CRITICAL - Socket timeout after 10 seconds
[00:29:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1197 is CRITICAL - Socket timeout after 10 seconds
[00:30:03] <Melos>	 :D
[00:31:09] <icinga-wm>	 PROBLEM - HHVM queue size on mw1197 is CRITICAL 75.00% of data above the critical threshold [80.0]
[00:31:30] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1197 is CRITICAL 77.78% of data above the critical threshold [115.2]
[00:31:50] <yuvipanda>	 uh oh
[00:31:52] * yuvipanda restarts hhvm
[00:32:18] <yuvipanda>	 !log restarted hhvm on mw1197
[00:32:26] <morebots>	 Logged the message, Master
[00:33:49] <icinga-wm>	 RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 66635 bytes in 1.304 second response time
[00:33:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.145 second response time
[00:38:13] <wikibugs>	 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1259534 (10hashar) I was really just wondering about pre existing usage of ZeroKeeper. @joe promptly addressed it at T95656#1220342 :-]  Welcome etcd!
[00:39:10] <icinga-wm>	 RECOVERY - HHVM queue size on mw1197 is OK Less than 30.00% above the threshold [10.0]
[00:39:40] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1197 is OK Less than 30.00% above the threshold [76.8]
[00:44:49] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Make checker class inherit toollabs base class [puppet] - 10https://gerrit.wikimedia.org/r/208862 
[00:45:10] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Make checker class inherit toollabs base class [puppet] - 10https://gerrit.wikimedia.org/r/208862 (owner: 10Yuvipanda)
[01:02:02] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Require uwsgi packages for checker [puppet] - 10https://gerrit.wikimedia.org/r/208867 
[01:02:23] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Require uwsgi packages for checker [puppet] - 10https://gerrit.wikimedia.org/r/208867 (owner: 10Yuvipanda)
[01:11:14] <grrrit-wm>	 (03PS1) 10Springle: repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208869 
[01:12:23] <grrrit-wm>	 (03PS2) 10Springle: repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208869 
[01:13:11] <grrrit-wm>	 (03CR) 10Springle: [C: 032] repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208869 (owner: 10Springle)
[01:19:06] <grrrit-wm>	 (03Merged) 10jenkins-bot: repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208869 (owner: 10Springle)
[01:20:13] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: repool db1070, warm up (duration: 00m 19s)
[01:20:22] <morebots>	 Logged the message, Master
[01:21:45] <Negative24>	 twentyafterfour: just so its on your radar, https://gerrit.wikimedia.org/r/#/c/208848/. we may have to implement the alternative for security.allow-outbound-http before wednesday's deployment if that option is a must
[01:31:25] <wikibugs>	 6operations, 6Labs, 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1259659 (10coren) I don't think that's likely to be possible in the general case; we might be able - at some cost - to gather a list of files that were written around the right timeframe but, unless we know what w...
[01:36:09] * Fiona looks up wmgUseBits.
[01:37:38] <grrrit-wm>	 (03PS1) 10Springle: depool db1021, move s5 api to db1049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208876 
[01:38:49] <grrrit-wm>	 (03CR) 10Springle: [C: 032] depool db1021, move s5 api to db1049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208876 (owner: 10Springle)
[01:38:54] <grrrit-wm>	 (03Merged) 10jenkins-bot: depool db1021, move s5 api to db1049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208876 (owner: 10Springle)
[01:41:34] <logmsgbot>	 !log springle Synchronized wmf-config/db-eqiad.php: depool db1021, move s5 api to db1049 (duration: 00m 15s)
[01:41:41] <morebots>	 Logged the message, Master
[01:44:29] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[01:47:29] <yuvipanda>	 ori: btw, <3 require_package
[01:47:35] <ori>	 :)
[01:47:53] <ori>	 thanks, it was very painful to get right, as bd808 can attest
[01:48:10] <yuvipanda>	 heh
[01:48:24] <yuvipanda>	 I like completely ignoring the hard problems and then suddenly being able to benefit from them being solved :D
[01:49:11] <ori>	 it wasn't an intellectually deep problem, just a super-annoying one
[01:49:19] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0]
[01:49:32] <Krenair>	 ori, know anything about bits being broken in beta?
[01:49:36] <yuvipanda>	 ori: super annoying ones are the worst
[01:49:51] <ori>	 Krenair: no, but I saw there was some phab task
[01:50:02] <ori>	 Let me take a look
[01:50:39] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail
[01:52:19] <grrrit-wm>	 (03PS1) 10Ori.livneh: Default wmgUseBits to `false` on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208878 
[01:52:29] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Default wmgUseBits to `false` on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208878 (owner: 10Ori.livneh)
[01:52:34] <grrrit-wm>	 (03Merged) 10jenkins-bot: Default wmgUseBits to `false` on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208878 (owner: 10Ori.livneh)
[01:53:30] <ori>	 Fiona: the config var is temporary
[01:53:37] <ori>	 <Fiona> I've heard that one before.
[01:53:53] <ori>	 I should be able to get rid of it tomorrow
[01:54:14] <ori>	 Krenair: better?
[01:54:35] <Fiona>	 I just read https://phabricator.wikimedia.org/T95448
[01:55:00] <Fiona>	 I had to update my clone of operations-mediawiki-config. It was a whole to-do.
[01:55:25] <ori>	 Max still made us it put it there: http://en.wikipedia.org/w/load.php
[01:55:30] <ori>	 but no bits are involved.
[01:55:35] <Fiona>	 I remember when Domas created bits.
[01:55:42] <Fiona>	 https:// *
[01:55:53] <Krenair>	 http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page still seems to load stuff from bits
[01:55:54] <Fiona>	 Missing a trailing newline!!!
[01:56:05] <ori>	 Krenair: refresh or log in
[01:56:12] <Krenair>	 I did refresh
[01:56:21] <ori>	 well, cache-bust
[01:56:24] <ori>	 ?q=123109283
[01:56:26] <ori>	 or something
[01:56:44] <James_F>	 Krenair: Now works fine for me in incognito and logged-in, BTW.
[01:56:45] <Krenair>	 hmph. that worked
[01:56:54] <Krenair>	 maybe beta was just being slow updating
[01:56:55] <Fiona>	 ori: I saw https://news.ycombinator.com/item?id=9484757 today and thought of you.
[01:57:28] <ori>	 Fiona: cute
[01:57:32] <ori>	 what was the extent of the breakage, James_F?
[01:58:12] <legoktm>	 > Everybody knows PHP is a trickly-typed language. Read the docs people or PHP will take advantage of your gullible ass. 
[01:58:14] <James_F>	 ori: Beta beta was 400-ing.
[02:01:43] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 
[02:02:29] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (owner: 10Yuvipanda)
[02:08:39] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:26:48] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 08m 20s)
[02:27:00] <morebots>	 Logged the message, Master
[02:31:48] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf3) at 2015-05-05 02:30:45+00:00
[02:31:56] <morebots>	 Logged the message, Master
[02:38:35] <wikibugs>	 6operations, 10Wikimedia-Site-requests: refreshLinks.php --dfn-only cron jobs do not seem to be running - https://phabricator.wikimedia.org/T97926#1259688 (10PleaseStand) Today, the script did run on dewiki (in s5); see the query results linked from T98110.  I think the differing file permissions did prevent t...
[02:40:09] <bblack>	 I may be way off base here, but: re T97926 ^
[02:40:41] <bblack>	 you know, we did wipe out a bunch of refreshLinks jobs from the jobrunner queue directly on Sunday
[02:41:05] <bblack>	 for enwiki, commonswiki, dewiki, as part of trying to fix the outage issue that day
[02:41:19] <bblack>	 Krenair: ^
[02:41:59] <Krenair>	 that job doesn't run on enwiki AFAIK
[02:42:09] <Krenair>	 well
[02:42:15] <bblack>	 dewiki?
[02:42:31] <Krenair>	 I think it's that particular --dfn-only thing that doesn't run on enwiki, or something
[02:42:45] <Krenair>	 it apparently got run on dewiki
[02:43:00] <Krenair>	 according to PleaseStand's comment
[02:43:12] <bblack>	 right, it would have run at the next opportunity, and technical did run May 3, but got the axe in the job queue during debugging
[02:43:42] <bblack>	 just seems like not-a-coincidence that we killed off refreshlinks jobs on May 3, and someone files a ticket on May 3 about their refreshlinks jobs not having run
[02:43:57] <Krenair>	 Anyway this bug is only concerned about s2/s3 (enwiki is s1, commonswiki is s4 and dewiki is s5)
[02:44:04] <bblack>	 hmmmm ok
[02:44:10] <Krenair>	 The problem was going back to June last year
[02:54:18] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 07m 06s)
[02:54:31] <morebots>	 Logged the message, Master
[02:58:58] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf4) at 2015-05-05 02:57:54+00:00
[02:59:05] <morebots>	 Logged the message, Master
[02:59:15] <Krenair>	 hmm
[02:59:25] <Krenair>	 why do we still have a pmtpa reference in wikitech.php?
[02:59:33] <Krenair>	 $wgOpenStackManagerProxyGateways = array('pmtpa' => '208.80.153.214', 'eqiad' => '208.80.155.156');
[03:01:22] <Krenair>	 krenair@silver:~$ mwscript eval.php labswiki
[03:01:27] <Krenair>	 Notice: Undefined index: SERVER_NAME in /srv/mediawiki/wmf-config/CommonSettings.php on line 206
[03:01:27] <Krenair>	 mkdir: cannot create directory �/sys/fs/cgroup/memory/mediawiki/job/16043�: Permission denied
[03:01:27] <Krenair>	 limit.sh: failed to create the cgroup.
[03:01:29] <Krenair>	 sigh
[03:02:36] <bblack>	 cgroups, really?
[03:02:58] <bblack>	 that's going to be fun when it comes time to move whatever that is to jessie...
[03:04:27] <bblack>	 (because everything in jessie runs underneath systems, which puts everything in its own set of cgroups.  I don't think said things (as in your shell, that script, that mkdir) can then escape that cgroup easily to create a separate root-level cgroup in any sane way)
[03:04:33] <bblack>	 s/systems/systemd/
[03:06:17] <Krenair>	 It's https://phabricator.wikimedia.org/T92712 again (still?)
[03:06:31] <Krenair>	 but the SERVER_NAME thing as well now
[04:23:26] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) 
[04:24:06] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) (owner: 10Yuvipanda)
[04:38:08] <grrrit-wm>	 (03PS3) 10Yuvipanda: tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) 
[04:39:16] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) (owner: 10Yuvipanda)
[04:43:52] <logmsgbot>	 !log tstarling Synchronized php-1.26wmf3/extensions/SecurePoll/cli/wm-scripts/bv2015/voterList.php: (no message) (duration: 00m 19s)
[04:43:58] <morebots>	 Logged the message, Master
[05:07:33] <logmsgbot>	 !log tstarling Synchronized php-1.26wmf3/extensions/SecurePoll/cli/wm-scripts/bv2015/voterList.php: (no message) (duration: 00m 16s)
[05:07:39] <morebots>	 Logged the message, Master
[05:51:04] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Tue May  5 05:50:01 UTC 2015 (duration 50m 0s)
[05:51:10] <morebots>	 Logged the message, Master
[06:29:59] <icinga-wm>	 PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures
[06:29:59] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures
[06:30:20] <icinga-wm>	 PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures
[06:30:50] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures
[06:31:00] <icinga-wm>	 PROBLEM - puppet last run on elastic1030 is CRITICAL Puppet has 1 failures
[06:31:00] <icinga-wm>	 PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures
[06:31:10] <icinga-wm>	 PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 2 failures
[06:31:20] <icinga-wm>	 PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures
[06:32:29] <icinga-wm>	 PROBLEM - puppet last run on mw1144 is CRITICAL Puppet has 2 failures
[06:35:10] <icinga-wm>	 PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 3 failures
[06:46:10] <icinga-wm>	 RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:46:19] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures
[06:46:40] <icinga-wm>	 RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:46:40] <icinga-wm>	 RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 second ago with 0 failures
[06:47:09] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures
[06:47:09] <icinga-wm>	 RECOVERY - puppet last run on mw1144 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:21] <icinga-wm>	 RECOVERY - puppet last run on elastic1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:21] <icinga-wm>	 RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures
[06:47:30] <icinga-wm>	 RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:40] <icinga-wm>	 RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures
[07:39:13] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: diamond: collectors require python-diamond (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/208840 (owner: 10Hashar)
[07:44:51] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[07:49:50] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0]
[07:59:08] <godog>	 !log test reboot fluorine with new disk
[07:59:16] <morebots>	 Logged the message, Master
[08:04:18] <matanya>	 _joe_: morning, are you aware of the fact hhvm 3.7.0 was released today ?
[08:04:36] <_joe_>	 matanya: yeah but it's not a LTS release
[08:04:40] <_joe_>	 we just track those
[08:05:16] <_joe_>	 and well, 3.3 => 3.6 has been quite complicated to do.
[08:05:29] <_joe_>	 and we can't really use FB's packages either
[08:05:41] <matanya>	 ah, ok. thanks for that. another question, if you don't mind. will the video scalers support vp9 in the near future ?
[08:06:41] <_joe_>	 matanya: no idea, sorry, I'm working on something completely different right now 
[08:07:06] <matanya>	 i'll see if there is a ticket for that
[08:07:23] <_joe_>	 it's surely something we may want to do, but I don't see anyone with the bandwidth to work on that, nor in ops or in any other post-reorg team
[08:07:47] <matanya>	 found it: https://phabricator.wikimedia.org/T55863
[08:07:54] <_joe_>	 but I may be wrong, my views are quite fuzzy right now - dust will settle eventually
[08:08:50] <matanya>	 yeah, i see. will wait for this too. thanks much _joe_ 
[08:11:41] <_joe_>	 matanya: as a community member, you should probably speak with someone in product to ask for resources dedicated to the videoscalers/multimedia in general 
[08:12:39] <matanya>	 _joe_: sad to say, but from my POV, multimedia and admin tools are the most neglected areas at WMF eng department. 
[08:13:27] <_joe_>	 matanya: well, I may agree, but tbh I think it's a good idea to focus our dev efforts on specific goals and try to nail those down
[08:13:51] <_joe_>	 instead of continuosly work on all the 1000 things we do. Our resources are quite constrained
[08:16:40] <matanya>	 yes, fair point
[08:46:15] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Revert "Depool ulsfo, network troubles" [dns] - 10https://gerrit.wikimedia.org/r/208918 
[08:46:25] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Revert "Depool ulsfo, network troubles" [dns] - 10https://gerrit.wikimedia.org/r/208918 
[08:46:31] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Revert "Depool ulsfo, network troubles" [dns] - 10https://gerrit.wikimedia.org/r/208918 (owner: 10Faidon Liambotis)
[08:47:01] <paravoid>	 !log repooling ulsfo
[08:47:09] <morebots>	 Logged the message, Master
[09:04:50] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[09:08:10] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0]
[09:09:44] <grrrit-wm>	 (03Abandoned) 10Hashar: diamond: collectors require python-diamond [puppet] - 10https://gerrit.wikimedia.org/r/208840 (owner: 10Hashar)
[09:09:49] <grrrit-wm>	 (03CR) 10Hashar: diamond: collectors require python-diamond (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/208840 (owner: 10Hashar)
[09:10:59] <icinga-wm>	 PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail
[09:14:10] <icinga-wm>	 RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:20:36] <valhallasw`nuage>	 hashar: every puppet dependency issue can be solved by running puppet often enough =p
[09:22:55] <valhallasw`nuage>	 (which is part of what makes them hard to debug...)
[09:23:40] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0]
[09:25:20] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[09:30:36] <valhallasw`nuage>	 hashar: maybe we should switch to MOAR? https://puppetlabs.com/blog/introducing-manifest-ordered-resources
[09:30:42] <godog>	 hashar: I agree it is simpler to treat puppet as converging over time
[09:31:01] <valhallasw`nuage>	 basically, there's a puppet setting that keeps the ordering in the manifest file
[09:31:08] <_joe_>	 godog: simpler and lamer
[09:31:09] <_joe_>	 :P
[09:32:04] <godog>	 machines don't judge!
[09:32:37] <godog>	 uncertain gain vs certain loss
[09:32:52] <valhallasw`nuage>	 it's like LaTeX! Just run it a gazillion times, and at some point your references will be right
[09:33:44] <godog>	 yep, and if it diverges you'll fairly quickly
[09:34:39] <godog>	 you'll know, even
[09:37:56] <_joe_>	 valhallasw`nuage: I don't think I never ever needed to recompile a latex doc more than three times before it was ok 
[09:38:01] <_joe_>	 :P
[09:38:13] <_joe_>	 with puppet, OTOH...
[09:40:44] <valhallasw`nuage>	 _joe_: when I insert a new bibtex reference, 3 is standard (compile, resolve bibtex references, compile with inserted bibliography, recompile to get references inserted), and 4 happens regularly (because something shifts just over a page edge due to the insertion of those references).
[09:41:01] <valhallasw`nuage>	 _joe_: and it's probably 5 passes when starting from just the .tex and .bib, but I try not to do that :D
[09:41:21] <valhallasw`nuage>	 but yeah, puppet is more... random
[09:41:22] <_joe_>	 valhallasw`nuage: I think it was 3 for bibtex/toc
[09:41:41] <_joe_>	 valhallasw`nuage: I don't seriously use latex since I left academia though, so... 2008
[09:42:04] <_joe_>	 I mean I used latex-beamer a couple of times for presentations before I chose life
[09:47:10] <valhallasw`nuage>	 I thought latex-beamer was cool once, because you could add pretty formulas to your slides. At some point I realized formulas on slides are a bad idea...
[09:47:23] <_joe_>	 well, it depends
[09:47:36] <_joe_>	 if you're presenting some theoretical physics paper, maybe :P
[09:48:51] <wikibugs>	 6operations, 5Continuous-Integration-Isolation: Disable diamond collector on contintcloud labs project - https://phabricator.wikimedia.org/T98121#1259982 (10hashar) 3NEW a:3hashar
[09:57:01] <grrrit-wm>	 (03PS1) 10Hashar: standard: ability to disable diamond [puppet] - 10https://gerrit.wikimedia.org/r/208924 (https://phabricator.wikimedia.org/T98121) 
[09:58:11] <grrrit-wm>	 (03CR) 10Hashar: "The class parameter default to true so that should be a complete noop unless I fail to understand puppet." [puppet] - 10https://gerrit.wikimedia.org/r/208924 (https://phabricator.wikimedia.org/T98121) (owner: 10Hashar)
[10:00:33] <valhallasw`nuage>	 hashar: famous last words
[10:00:46] <hashar>	 valhallasw`nuage: if only puppetlabs.com worked :D
[10:01:27] <hashar>	 valhallasw`nuage: a  friend demoed me propellor, a  configuration management system using haskell
[10:01:48] <valhallasw`nuage>	 and it tells you what's wrong with it before you even run ghc? :p
[10:01:58] <hashar>	 valhallasw`nuage: you can compile your "manifest" locally  and that does the validation / ensure all cases are handled  because... haskell!
[10:02:04] <hashar>	 https://propellor.branchable.com/
[10:02:15] <hashar>	 the author has a bunch of blog posts
[10:02:36] <valhallasw`nuage>	 hashar: I suppose, but it can't make sure the manifest actually works, because it e.g. doesn't know what the effect of a change will be
[10:02:50] <hashar>	 yuo
[10:02:51] <hashar>	 yup
[10:02:59] <hashar>	 but the dependencies / ordering is dealt with earlier
[10:03:20] <_joe_>	 propellor: propel yourself into pain and irrelevance with haskell!!!1!
[10:03:46] <valhallasw`nuage>	 or we can make puppet smart enough to track what it installs, and to understand why failures are now suddenly solved
[10:03:50] <valhallasw`nuage>	 hmm.
[10:04:19] <_joe_>	 valhallasw`nuage: what is the problem you want to solve?
[10:04:20] <icinga-wm>	 PROBLEM - puppet last run on fluorine is CRITICAL: Connection refused by host
[10:04:33] <_joe_>	 godog: this you ^^?
[10:04:35] <valhallasw`nuage>	 _joe_: manifests that only work on the third run and no-one understanding why :-p
[10:04:45] <_joe_>	 well, which ones?
[10:04:58] <godog>	 _joe_: yeah downtime finished, rescheduling
[10:05:15] <_joe_>	 valhallasw`nuage: usually that's because they're poorly written or not properly refactored when new features are added
[10:05:46] <valhallasw`nuage>	 the diamond one hashar tried to fix, for instance
[10:05:48] <_joe_>	 valhallasw`nuage: or, one of the numerous bugs in puppet's dependency chanins
[10:06:11] <valhallasw`nuage>	 sure, it's typically because the manifest is wrong, but it's really hard to see /why/ it's wrong because it's hard to guess depenencies
[10:06:29] <valhallasw`nuage>	 and it's hard to test because the issue 'solves itself' on subsequent puppet runs
[10:06:35] <_joe_>	 uhm I'm not usually in that position :)
[10:06:56] <_joe_>	 but I dunno, I'd have to look at the manifest, and I don't have time right now
[10:09:05] <wikibugs>	 7Puppet, 6operations, 10Beta-Cluster: Trebuchet on deployment-bastion: wrong group owner - https://phabricator.wikimedia.org/T97775#1260027 (10mobrovac) >>! In T97775#1252075, @chasemp wrote: > sure, I mean all of those should be owned by trebuchet and deployment since deployment is the group for deployers....
[10:09:08] <hashar>	 This kernel does not support a non-PAE CPU.
[10:09:09] <hashar>	 \o/
[10:12:01] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Disable manual puppetmaster start/restart [puppet] - 10https://gerrit.wikimedia.org/r/208926 
[10:12:46] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "Obviously a good idea" [puppet] - 10https://gerrit.wikimedia.org/r/208926 (owner: 10Alexandros Kosiaris)
[10:13:27] <grrrit-wm>	 (03PS4) 10Giuseppe Lavagetto: hiera: Add a proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/207128 (https://phabricator.wikimedia.org/T93776) 
[10:14:02] <grrrit-wm>	 (03CR) 10Hashar: "Shouldn't it autostart on labs instances having their self puppetmaster?" [puppet] - 10https://gerrit.wikimedia.org/r/208926 (owner: 10Alexandros Kosiaris)
[10:14:07] <_joe_>	 akosiaris: I am going to merge the proxy backend, so that we can later do a full catalog differ run on the change that starts using it.
[10:15:32] <MaxSem>	 hashar, uh... win?
[10:15:50] <icinga-wm>	 RECOVERY - puppet last run on fluorine is OK Puppet is currently enabled, last run 22 minutes ago with 0 failures
[10:17:08] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "@hashar, labs instances do not use passenger (well they can but that requires manual work so we can safely assume almost none does)" [puppet] - 10https://gerrit.wikimedia.org/r/208926 (owner: 10Alexandros Kosiaris)
[10:19:55] <akosiaris>	 _joe_: ok
[10:32:22] <wikibugs>	 6operations, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Investigation if Fluorine needs bigger disks or we retain too much data - https://phabricator.wikimedia.org/T92417#1260071 (10fgiunchedi) ```    8        0 2930266584 sda    8        1    7811072 sda1    8        2   78125056 sda2    8        3 18671...
[10:33:58] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: graphoid: create admin group [puppet] - 10https://gerrit.wikimedia.org/r/208929 
[10:40:28] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] graphoid: create admin group [puppet] - 10https://gerrit.wikimedia.org/r/208929 (owner: 10Alexandros Kosiaris)
[10:40:52] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Disable manual puppetmaster start/restart [puppet] - 10https://gerrit.wikimedia.org/r/208926 (owner: 10Alexandros Kosiaris)
[11:00:27] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Assign weights to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/208933 
[11:09:36] <wikibugs>	 6operations: Scale up puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1260135 (10akosiaris) 3NEW a:3akosiaris
[11:14:22] <wikibugs>	 6operations: Investigate the compatibility of our puppet tree with ruby1.9 and create a plan to upgrade. - https://phabricator.wikimedia.org/T98129#1260150 (10akosiaris) 3NEW a:3akosiaris
[11:16:12] <wikibugs>	 6operations: Scale up puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1260161 (10akosiaris)
[11:16:12] <wikibugs>	 7Puppet, 6operations: puppet masters are maxed out - https://phabricator.wikimedia.org/T97989#1260160 (10akosiaris)
[11:16:45] <wikibugs>	 6operations: Scale up puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1260135 (10akosiaris)
[11:16:45] <wikibugs>	 6operations: Investigate the compatibility of our puppet tree with ruby1.9 and create a plan to upgrade. - https://phabricator.wikimedia.org/T98129#1260174 (10akosiaris)
[11:17:13] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Assign weights to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/208933 (https://phabricator.wikimedia.org/T98128) 
[11:19:35] <hashar>	 E: Unable to locate package quickstack
[11:19:39] <hashar>	 how annoying :]
[11:22:25] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: WIP: Proper labs_storage class (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren)
[11:23:46] <matanya>	 Krenair: can you arrage otrs access for me to the crash reports queue please ?
[11:24:03] <matanya>	 i am an idiot. i have it.
[11:25:56] <Steinsplitter>	 matanya, you are not an idiot :> pls be kind to yourself :P  :-)
[11:31:40] <wikibugs>	 6operations: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1260188 (10faidon)
[11:33:18] <wikibugs>	 6operations: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1260150 (10faidon) jessie comes with 2.1 (which should be even faster than 1.9) so I adjusted the description accordingly. That said, I'd expect most of the compatibi...
[11:36:21] <wikibugs>	 6operations, 7Monitoring: Upgrade to newer version of gdash - https://phabricator.wikimedia.org/T98134#1260208 (10faidon) 3NEW
[11:39:37] <hashar>	 ah puppet apply does not accept multiples --execute :/
[11:39:44] <hashar>	 but bash can helps!   puppet apply <(echo -e "notify { 'foo': }\nnotify { 'bar': }" )
[11:54:49] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: ganglia_new::web. Increase the default memory limit [puppet] - 10https://gerrit.wikimedia.org/r/208937 (https://phabricator.wikimedia.org/T97637) 
[11:58:04] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] ganglia_new::web. Increase the default memory limit [puppet] - 10https://gerrit.wikimedia.org/r/208937 (https://phabricator.wikimedia.org/T97637) (owner: 10Alexandros Kosiaris)
[12:00:14] <akosiaris>	 godog: https://gerrit.wikimedia.org/r/#/c/208933/2/manifests/role/puppetmaster.pp,cm
[12:00:23] <akosiaris>	 I think this should give us some breathing room
[12:00:49] <icinga-wm>	 PROBLEM - puppet last run on cp3036 is CRITICAL puppet fail
[12:05:35] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032] Assign weights to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/208933 (https://phabricator.wikimedia.org/T98128) (owner: 10Alexandros Kosiaris)
[12:05:42] <godog>	 akosiaris: sweet! LGTM
[12:18:43] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage)
[12:18:49] <icinga-wm>	 RECOVERY - puppet last run on cp3036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:35:20] <matanya>	 hoo: server side upload for me please ?
[12:35:39] <hoo>	 Sure thing
[12:35:44] <hoo>	 What do you need uploaded?
[12:35:51] <matanya>	 a movie
[12:36:03] <matanya>	 do you have access to the video project ?
[12:36:55] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Notify apache on ganglia_new::web php.ini updates [puppet] - 10https://gerrit.wikimedia.org/r/208947 
[12:36:55] <matanya>	 hoo: the file is in encoding01.wmflabs.org:/home/matanya/Sintel.webm
[12:37:26] <matanya>	 the description is in the same as https://commons.wikimedia.org/w/index.php?title=File:Sintel_movie_720x306.ogv&action=edit
[12:37:38] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Notify apache on ganglia_new::web php.ini updates [puppet] - 10https://gerrit.wikimedia.org/r/208947 (owner: 10Alexandros Kosiaris)
[12:37:46] <matanya>	 the file name should be Sintel movie 4K.webm
[12:38:49] <hoo>	 Ok, will do
[12:39:16] <matanya>	 thanks
[12:42:11] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Notify apache on ganglia_new::web php.ini updates [puppet] - 10https://gerrit.wikimedia.org/r/208947 
[12:46:05] <matanya>	 hoo: please ping me when done. 
[12:46:31] <hoo>	 I'm having slight trouble with rsync, give me a sec
[12:46:41] <hoo>	 I hate having to do the proxycommand stuff inline...
[12:47:59] <hoo>	 I manage to get onto the labs bastion, but not onto encoding02.eqiad.wmflabs:22
[12:48:32] <matanya>	 hoo: encoding01
[12:48:58] <hoo>	 That explains a lot...
[12:49:10] <hoo>	 rsync: change_dir "/home/matanya" failed: Permission denied (13)
[12:49:33] <matanya>	 do you want me to move it to somewhere shared ?
[12:49:38] <hoo>	 Can you move it into a dir I have +x on?
[12:49:48] <hoo>	 yes, please
[12:51:12] <matanya>	 moving to /data/project/wikimania2014
[12:51:18] <matanya>	 will take some minutes
[12:51:23] <hoo>	 Ok
[12:51:28] <matanya>	 or not. done.
[12:51:44] <hoo>	 Nice
[12:52:32] <hoo>	 Copying at 38MB/s :)
[12:54:12] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Notify apache on ganglia_new::web php.ini updates [puppet] - 10https://gerrit.wikimedia.org/r/208947 (owner: 10Alexandros Kosiaris)
[12:56:05] <wikibugs>	 6operations, 10Traffic, 7discovery-system: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1260323 (10Joe) p:5Low>3High a:3Joe
[12:56:33] <hoo>	 matanya: Upload with your main account?
[12:56:43] <matanya>	 yes, please
[12:57:46] <hoo>	 Upload started
[12:58:15] <hoo>	 Uploads are very fast with the new swift servers in place nowadays :)
[12:59:09] <godog>	 yeah 12 * 3 new spindles helped
[12:59:52] <matanya>	 i wish i could upload from server side. the limits are funny
[13:00:25] <hoo>	 Even in that case you have a 4.3(?)GiB limit
[13:00:36] <hoo>	 I hit that once or twice while uploading things for people
[13:01:09] <hoo>	 In a more perfect world we would only do server side uploads in case of huge batches of filesw
[13:01:35] <hoo>	 {{done}}
[13:02:35] <godog>	 matanya: why can't you?
[13:02:56] <matanya>	 no rights godog 
[13:03:13] <matanya>	 thank you very much hoo 
[13:03:18] <hoo>	 You're welcome
[13:04:05] <matanya>	 ah, the quality! joyfull moments. 
[13:04:08] <TimStarling>	 !log updating voter list for the FDC election for T97924
[13:04:14] <morebots>	 Logged the message, Master
[13:05:45] <grrrit-wm>	 (03CR) 10JanZerebecki: "As detailed in the ticket beta is more complicated. Because it uses the same wiki id for the config as www.wikidata.org so divering those " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208654 (https://phabricator.wikimedia.org/T97993) (owner: 10JanZerebecki)
[13:09:20] <icinga-wm>	 PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193
[13:10:09] <icinga-wm>	 PROBLEM - puppet last run on mw2002 is CRITICAL puppet fail
[13:10:09] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL puppet fail
[13:10:09] <icinga-wm>	 PROBLEM - puppet last run on mw1149 is CRITICAL puppet fail
[13:10:10] <icinga-wm>	 PROBLEM - puppet last run on cp4019 is CRITICAL puppet fail
[13:10:10] <icinga-wm>	 PROBLEM - puppet last run on multatuli is CRITICAL puppet fail
[13:10:19] <icinga-wm>	 PROBLEM - puppet last run on wtp1012 is CRITICAL puppet fail
[13:10:20] <icinga-wm>	 PROBLEM - puppet last run on lithium is CRITICAL puppet fail
[13:10:30] <icinga-wm>	 PROBLEM - puppet last run on snapshot1001 is CRITICAL puppet fail
[13:10:40] <icinga-wm>	 PROBLEM - puppet last run on mw1118 is CRITICAL Puppet has 50 failures
[13:10:40] <icinga-wm>	 PROBLEM - puppet last run on mw1044 is CRITICAL puppet fail
[13:10:40] <icinga-wm>	 PROBLEM - puppet last run on silver is CRITICAL puppet fail
[13:10:50] <icinga-wm>	 PROBLEM - puppet last run on pc1002 is CRITICAL puppet fail
[13:10:59] <icinga-wm>	 PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 8 failures
[13:11:00] <icinga-wm>	 PROBLEM - puppet last run on labstore1001 is CRITICAL puppet fail
[13:11:06] <bblack>	 well that's exciting
[13:11:10] <icinga-wm>	 PROBLEM - puppet last run on mw2166 is CRITICAL puppet fail
[13:11:10] <icinga-wm>	 PROBLEM - puppet last run on labcontrol2001 is CRITICAL puppet fail
[13:11:10] <icinga-wm>	 PROBLEM - puppet last run on mw1211 is CRITICAL Puppet has 1 failures
[13:11:10] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 3 failures
[13:11:10] <icinga-wm>	 PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 8 failures
[13:11:10] <icinga-wm>	 PROBLEM - puppet last run on db1021 is CRITICAL Puppet has 9 failures
[13:11:10] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 8 failures
[13:11:19] <icinga-wm>	 PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 9 failures
[13:11:20] <icinga-wm>	 PROBLEM - puppet last run on wtp2019 is CRITICAL puppet fail
[13:11:20] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 10 failures
[13:11:20] <icinga-wm>	 PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 10 failures
[13:11:20] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 8 failures
[13:11:21] <godog>	 7kick icinga-wm
[13:11:30] <icinga-wm>	 PROBLEM - puppet last run on mw1177 is CRITICAL Puppet has 2 failures
[13:11:30] <icinga-wm>	 PROBLEM - puppet last run on virt1001 is CRITICAL puppet fail
[13:11:30] <icinga-wm>	 PROBLEM - puppet last run on mw1237 is CRITICAL puppet fail
[13:11:30] <icinga-wm>	 PROBLEM - puppet last run on mw1054 is CRITICAL Puppet has 39 failures
[13:11:30] <icinga-wm>	 PROBLEM - puppet last run on mw1129 is CRITICAL puppet fail
[13:11:39] <icinga-wm>	 PROBLEM - puppet last run on mw1126 is CRITICAL Puppet has 2 failures
[13:11:40] <icinga-wm>	 PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 9 failures
[13:11:40] <icinga-wm>	 PROBLEM - puppet last run on mw2136 is CRITICAL Puppet has 1 failures
[13:11:40] <icinga-wm>	 PROBLEM - puppet last run on wtp2015 is CRITICAL Puppet has 5 failures
[13:11:40] <icinga-wm>	 PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 10 failures
[13:11:40] <icinga-wm>	 PROBLEM - puppet last run on mw2090 is CRITICAL Puppet has 8 failures
[13:11:40] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 7 failures
[13:11:49] <icinga-wm>	 PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 4 failures
[13:11:49] <godog>	 akosiaris: ^ ?
[13:11:59] <icinga-wm>	 PROBLEM - puppet last run on mw1039 is CRITICAL Puppet has 54 failures
[13:12:00] <icinga-wm>	 PROBLEM - puppet last run on mw1011 is CRITICAL Puppet has 38 failures
[13:12:00] <icinga-wm>	 PROBLEM - puppet last run on dataset1001 is CRITICAL Puppet has 16 failures
[13:12:09] <icinga-wm>	 PROBLEM - puppet last run on mw1213 is CRITICAL Puppet has 1 failures
[13:12:10] <bblack>	 ESC[1;31mError: Could not send report: Connection reset by peer - SSL_connectESC[0m
[13:12:10] <icinga-wm>	 PROBLEM - puppet last run on labnet1001 is CRITICAL Puppet has 1 failures
[13:12:19] <icinga-wm>	 PROBLEM - puppet last run on mw1175 is CRITICAL Puppet has 8 failures
[13:12:20] <icinga-wm>	 PROBLEM - puppet last run on db2042 is CRITICAL Puppet has 6 failures
[13:12:20] <icinga-wm>	 PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 10 failures
[13:12:20] <icinga-wm>	 PROBLEM - puppet last run on mw2030 is CRITICAL Puppet has 6 failures
[13:12:20] <icinga-wm>	 PROBLEM - puppet last run on mc1012 is CRITICAL Puppet has 7 failures
[13:12:29] <icinga-wm>	 PROBLEM - puppet last run on mw2146 is CRITICAL Puppet has 10 failures
[13:12:30] <icinga-wm>	 PROBLEM - puppet last run on mw2114 is CRITICAL Puppet has 9 failures
[13:12:30] <icinga-wm>	 PROBLEM - puppet last run on mw2059 is CRITICAL Puppet has 10 failures
[13:12:30] <icinga-wm>	 PROBLEM - puppet last run on mw2092 is CRITICAL Puppet has 9 failures
[13:12:30] <icinga-wm>	 PROBLEM - puppet last run on mw2056 is CRITICAL Puppet has 4 failures
[13:12:30] <icinga-wm>	 PROBLEM - puppet last run on mw1172 is CRITICAL Puppet has 5 failures
[13:12:48] <akosiaris>	 godog: that would be me
[13:13:08] <akosiaris>	 !log restarted apache2 on palladium
[13:13:15] <morebots>	 Logged the message, Master
[13:13:20] <icinga-wm>	 PROBLEM - puppet last run on mw2011 is CRITICAL Puppet has 5 failures
[13:14:11] <matanya>	 godog: is there a reason the file is not transcoded ? 
[13:14:24] <akosiaris>	 seemed like a reload would not pickup the balancer change
[13:16:03] <godog>	 matanya: I know very little about the multimedia pipeline on the mw side sadly
[13:16:23] <matanya>	 that is a question for the mutlimedia guys ?
[13:17:08] <godog>	 I think so, but you'd have to be more specific
[13:18:35] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] ipsec-global: fix bug in non-verbose mode, exit if not root [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage)
[13:22:10] <icinga-wm>	 PROBLEM - MySQL Idle Transactions on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 948 seconds
[13:22:19] <akosiaris>	 ohoh
[13:23:44] <bblack>	 db1040 doesn't sound like puppetmaster issues heh
[13:24:15] <akosiaris>	 commonswiki
[13:24:30] <akosiaris>	 1000+ secs of Sleep connections
[13:24:32] <akosiaris>	 not many though
[13:24:42] <akosiaris>	 7 
[13:25:09] <akosiaris>	 and they 're gone now
[13:26:15] <springle>	 !log killed db1040 blocking txns T97641 again
[13:26:28] <wikibugs>	 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, 3Wikidata-Sprint-2015-04-21: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1260398 (10JanZerebecki) a:5hoo>3daniel What pattern can one search for to find old serialization?
[13:26:28] <godog>	 video scalers are not happy either
[13:26:40] <icinga-wm>	 RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures
[13:26:40] <icinga-wm>	 RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures
[13:26:49] <icinga-wm>	 RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures
[13:26:54] <godog>	 matanya: how much data was that upload?
[13:27:00] <icinga-wm>	 RECOVERY - puppet last run on lithium is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures
[13:27:00] <icinga-wm>	 RECOVERY - puppet last run on dataset1001 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures
[13:27:09] <icinga-wm>	 RECOVERY - puppet last run on mw1213 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures
[13:27:10] <icinga-wm>	 RECOVERY - MySQL Idle Transactions on db1040 is OK longest blocking idle transaction sleeps for 0 seconds
[13:27:10] <icinga-wm>	 RECOVERY - puppet last run on labnet1001 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures
[13:27:10] <icinga-wm>	 RECOVERY - puppet last run on mw1118 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures
[13:27:19] <icinga-wm>	 RECOVERY - puppet last run on mw1175 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures
[13:27:20] <icinga-wm>	 RECOVERY - puppet last run on db2042 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures
[13:27:20] <icinga-wm>	 RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures
[13:27:20] <icinga-wm>	 RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures
[13:27:20] <icinga-wm>	 RECOVERY - puppet last run on mc1012 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures
[13:27:29] <icinga-wm>	 RECOVERY - puppet last run on pc1002 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures
[13:27:30] <icinga-wm>	 RECOVERY - puppet last run on mw2146 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures
[13:27:30] <icinga-wm>	 RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:27:30] <icinga-wm>	 RECOVERY - puppet last run on mw2059 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures
[13:27:30] <icinga-wm>	 RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures
[13:27:30] <icinga-wm>	 RECOVERY - puppet last run on mw1172 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures
[13:27:30] <icinga-wm>	 RECOVERY - puppet last run on mw2056 is OK Puppet is currently enabled, last run 1 second ago with 0 failures
[13:27:30] <icinga-wm>	 RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:27:31] <icinga-wm>	 RECOVERY - puppet last run on labstore1001 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures
[13:27:40] <icinga-wm>	 RECOVERY - puppet last run on mw1211 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:27:40] <icinga-wm>	 RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:27:40] <icinga-wm>	 RECOVERY - puppet last run on db1021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:27:50] <icinga-wm>	 RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:27:50] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:27:50] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures
[13:27:51] <icinga-wm>	 RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures
[13:27:51] <icinga-wm>	 RECOVERY - puppet last run on wtp2019 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures
[13:27:51] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:27:51] <icinga-wm>	 RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures
[13:28:00] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:00] <matanya>	 godog: 3.26 GB
[13:28:01] <icinga-wm>	 RECOVERY - puppet last run on virt1001 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures
[13:28:01] <icinga-wm>	 RECOVERY - puppet last run on mw1177 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:09] <icinga-wm>	 RECOVERY - puppet last run on mw1237 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures
[13:28:09] <icinga-wm>	 RECOVERY - puppet last run on mw1054 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:10] <icinga-wm>	 RECOVERY - puppet last run on mw1129 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures
[13:28:10] <akosiaris>	 godog: springle killing those txns and matanya's video might be related
[13:28:10] <icinga-wm>	 RECOVERY - puppet last run on mw1126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:20] <icinga-wm>	 RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:20] <icinga-wm>	 RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures
[13:28:20] <icinga-wm>	 RECOVERY - puppet last run on mw2136 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures
[13:28:20] <icinga-wm>	 RECOVERY - puppet last run on mw2090 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures
[13:28:21] <icinga-wm>	 RECOVERY - puppet last run on mw1149 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures
[13:28:21] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:21] <icinga-wm>	 RECOVERY - puppet last run on mw2011 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures
[13:28:21] <icinga-wm>	 RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures
[13:28:28] <godog>	 akosiaris: very likely, http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Video+scalers+eqiad&h=&tab=m&vn=&hide-hf=false&m=network_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name
[13:28:30] <icinga-wm>	 RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:30] <icinga-wm>	 RECOVERY - puppet last run on wtp1012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:30] <icinga-wm>	 RECOVERY - puppet last run on mw1039 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:40] <icinga-wm>	 RECOVERY - puppet last run on mw1011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:48] <godog>	 matanya: 3.26gb total? ack
[13:28:49] <icinga-wm>	 RECOVERY - puppet last run on snapshot1001 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures
[13:28:50] <icinga-wm>	 RECOVERY - puppet last run on mw1044 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:28:50] <icinga-wm>	 RECOVERY - puppet last run on silver is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:29:08] <matanya>	 godog: should i be sorry ?
[13:29:20] <icinga-wm>	 RECOVERY - puppet last run on mw2166 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures
[13:29:59] <icinga-wm>	 RECOVERY - puppet last run on mw2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:30:06] <marktraceur>	 For the record, "the multimedia guys" are now three JavaScript coders, so sending people to us about transcoding problems might work about 20% of the time. :)
[13:30:08] <godog>	 matanya: no I don't think so, it'll take a while for the video scalers to chug through that, so that might be the answer to your transcode question
[13:30:39] <matanya>	 thanks
[13:30:56] <godog>	 marktraceur: haha how come 20%?
[13:31:55] <marktraceur>	 godog: Shrug, just a wild guess
[13:32:17] <akosiaris>	 somehow I am already seeing transcoid on the horizon
[13:33:18] <godog>	 marktraceur: hehe will keep that in mind
[13:33:22] <matanya>	 we should wikipedoid, a bot writting articles
[13:33:30] <matanya>	 *should have
[13:33:40] <matanya>	 that would be a nice service
[13:34:50] <marktraceur>	 matanya: We outsourced that job to Bollywood agents. They appear to continue doing it by hand.
[13:35:01] <matanya>	 haha
[13:37:14] <grrrit-wm>	 (03CR) 10Ottomata: "OO, nice! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda)
[13:37:20] <wikibugs>	 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, and 2 others: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1260419 (10Tobi_WMDE_SW) a:5daniel>3hoo
[13:38:00] <icinga-wm>	 PROBLEM - MySQL InnoDB on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 818 seconds
[13:39:11] <springle>	 great
[13:39:44] <matanya>	 springle: is that my video ?
[13:39:49] <icinga-wm>	 RECOVERY - MySQL InnoDB on db1040 is OK longest blocking idle transaction sleeps for 0 seconds
[13:40:02] <springle>	 matanya: no idea
[13:40:49] <hoo>	 I doubt it
[13:41:46] <hoo>	 There are several connections from terbium opened, maybe someone is doing weird stuff?
[13:42:45] <springle>	 the connections I've seen come from videoscalers T97641. Run that query, then sleep and hold locks for minutes
[13:43:47] <springle>	 It only really matters when multiple connections back up waiting on each other. Except that anything holding resources like this ends up mattering on a master.
[13:44:15] <godog>	 while waiting for avconv to finish perhaps? 
[13:44:16] <godog>	 www-data 18252  147  1.4 1041756 240920 ?      RNl  13:04  53:07  |                       \_ /usr/bin/avconv
[13:44:19] <godog>	 etc
[13:44:29] <springle>	 perhaps so. bad design
[13:45:16] <andrewbogott>	 I am pretty sure that the new labvirt1007 has all the same puppet roles as labvirt1001-1006 and yet it isn’t showing up in ganglia reports.  Where should I start?
[13:46:31] <hoo>	 springle: MediaWiki opens a transaction per default
[13:46:56] <hoo>	 usually you need to flush it before doing such blocking things
[13:49:10] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: graphite: split alerts role [puppet] - 10https://gerrit.wikimedia.org/r/208083 (https://phabricator.wikimedia.org/T97754) 
[13:49:41] <godog>	 ottomata: thoughts on https://gerrit.wikimedia.org/r/#/c/207805/ ?
[13:49:50] <bblack>	 andrewbogott: if it's a brand-new fresh host, try restarting ganglia-monitor service? sometimes it's hosed on first start
[13:50:03] <andrewbogott>	 bblack: ok, trying
[13:50:22] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] varnishkafka: use statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/207805 (https://phabricator.wikimedia.org/T95687) (owner: 10Filippo Giunchedi)
[13:50:26] <ottomata>	 godog: if you are ready, go for it!
[13:51:08] <godog>	 ottomata: yep I am! no other action needed? e.g. restart?
[13:51:22] <ottomata>	 naw, its a cron
[13:51:40] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: varnishkafka: use statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/207805 (https://phabricator.wikimedia.org/T95687) 
[13:51:40] <ottomata>	 i guess eventually remove the local statsds?
[13:51:46] <hoo>	 ottomata: Got a second?
[13:51:50] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnishkafka: use statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/207805 (https://phabricator.wikimedia.org/T95687) (owner: 10Filippo Giunchedi)
[13:51:50] <ottomata>	 sure
[13:52:09] <godog>	 ottomata: yep, got https://gerrit.wikimedia.org/r/208635 out but jenkins says no
[13:52:13] <hoo>	 ottomata: Can you have a look at oxygen... according to Ganglia it has some weird swap setting
[13:52:27] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) 
[13:52:44] <hoo>	 I wonder what swpaon -s says
[13:52:46] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Add a couple of settings to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/205979 
[13:52:48] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Rename virt1012 to labvirt1008. [puppet] - 10https://gerrit.wikimedia.org/r/208954 
[13:53:18] <ottomata>	 ok, hoo, oxygen no swap
[13:53:23] <ottomata>	 i recently reinstalled it
[13:53:57] <andrewbogott>	 bblack: that doesn’t seem to have done it, although maybe I just need to wait longer
[13:54:26] <hoo>	 ottomata: Weird... ganglia is going nuts on it
[13:54:33] <ottomata>	 link?
[13:54:36] <hoo>	 http://ganglia.wikimedia.org/latest/graph.php?h=oxygen.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1430833995&g=mem_report&z=medium&c=Miscellaneous%20eqiad
[13:54:56] <ottomata>	 ha, hm weird
[13:55:00] <ottomata>	 cool, look at all that swap!
[13:55:14] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Rename virt1012 to labvirt1008 [dns] - 10https://gerrit.wikimedia.org/r/208955 
[13:55:53] <hoo>	 Has a lot of swap death potential :D
[14:01:16] <grrrit-wm>	 (03PS6) 10Alexandros Kosiaris: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh)
[14:01:57] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh)
[14:02:41] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Rename virt1012 to labvirt1008. [puppet] - 10https://gerrit.wikimedia.org/r/208954 
[14:04:00] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Rename virt1012 to labvirt1008. [puppet] - 10https://gerrit.wikimedia.org/r/208954 (owner: 10Andrew Bogott)
[14:04:16] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Rename virt1012 to labvirt1008 [dns] - 10https://gerrit.wikimedia.org/r/208955 (owner: 10Andrew Bogott)
[14:07:21] <godog>	 !log shut fluorine to replace sdb
[14:07:26] <morebots>	 Logged the message, Master
[14:09:09] <grrrit-wm>	 (03PS7) 10Alexandros Kosiaris: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh)
[14:12:49] <icinga-wm>	 PROBLEM - Host virt1012 is DOWN: PING CRITICAL - Packet loss = 100%
[14:12:59] <andrewbogott>	 ^that’s me, renaming.
[14:13:08] <andrewbogott>	 Hm, somehow I scheduled downtime for all services but not for the host itself
[14:18:39] <icinga-wm>	 RECOVERY - Host virt1012 is UPING OK - Packet loss = 0%, RTA = 0.74 ms
[14:20:23] <andrewbogott>	 Coren: icinca is upset about labstore1002.  Should I ack, or mark as downtime, or…?
[14:21:03] <Coren>	 andrewbogott: Hm.  I was considering what to do about it now.  I'm going to power it back up now, simply.
[14:21:11] <andrewbogott>	 ok then :)
[14:22:09] <ottomata>	 cmjohnson1: you sound maybe busy, but do you think you can get to this today? https://phabricator.wikimedia.org/T98081
[14:22:35] <cmjohnson1>	 ottomata: I will look at it shortly...i promise :-)
[14:22:50] <ottomata>	 thank you!
[14:23:28] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to Rhenium for dkg - https://phabricator.wikimedia.org/T98148#1260568 (10dkg) 3NEW
[14:26:43] <cmjohnson1>	 ottomata: is an1037 off?
[14:26:49] <cmjohnson1>	 can I turn on?
[14:27:29] <ottomata>	 yes
[14:31:10] <icinga-wm>	 PROBLEM - Host analytics1037 is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:35] <icinga-wm>	 RECOVERY - Host analytics1037 is UPING OK - Packet loss = 0%, RTA = 5.72 ms
[14:42:45] <icinga-wm>	 RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset -0.00213599205 secs
[14:42:47] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) 
[14:48:49] <matanya>	 akosiaris: i'm getting 503 when uploading files
[14:48:56] <matanya>	 oh, he is not here
[14:49:50] <paravoid>	 Coren: labstore1002 is still red in icinga.
[14:50:00] <paravoid>	 godog: graphite2001 too
[14:50:12] <Coren>	 paravoid: Yeah, it's not coming up.
[14:50:29] <Coren>	 paravoid: I think the hw issue may have been deeper than just a badly seated card after all.
[14:51:15] <icinga-wm>	 PROBLEM - Host labvirt1008 is DOWN: PING CRITICAL - Packet loss = 100%
[14:53:46] <anomie>	 manybubbles, ^d, thcipriani, marktraceur: Who wants to SWAT this morning?
[14:53:46] <manybubbles>	 too busy today - sorry:(
[14:53:46] <yurik>	 akosiaris, hi, can you help with the trebuchet usage?
[14:53:46] <Coren>	 I'm trying to poke in the BIOS to see if I can get the EFI alerts to figure out what's up
[14:53:46] <anomie>	 matt_f_night, Dereckson: Ping for SWAT in about 7 minutes
[14:53:46] <icinga-wm>	 RECOVERY - Host labstore1002 is UPING OK - Packet loss = 0%, RTA = 2.90 ms
[14:53:46] <matt_flaschen>	 Present
[14:53:56] <icinga-wm>	 RECOVERY - Host labvirt1008 is UPING OK - Packet loss = 0%, RTA = 0.53 ms
[14:54:20] <thcipriani>	 anomie: can do swat, a little worried about my internet at the moment, but I think it's just a little sluggish.
[14:54:25] <anomie>	 thcipriani: ok
[14:54:38] <^d>	 busy as well
[14:55:35] <thcipriani>	 matt_flaschen: can you go ahead and merge your extension changes and bump the submodule on core for wmf{3,4}
[14:55:44] <matt_flaschen>	 thcipriani, yes, already in progress.
[14:55:51] <thcipriani>	 matt_flaschen: thanks!
[14:56:30] <Coren>	 Allright, I managed to have it boot at the third hardreset but I now call it officially suspect.
[14:57:45] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 7.69% of data above the critical threshold [500.0]
[14:57:47] <Coren>	 ...  aaaand it gets tons of IO errors.
[14:57:56] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0]
[14:59:22] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: statsite: decommission class [puppet] - 10https://gerrit.wikimedia.org/r/208635 (https://phabricator.wikimedia.org/T95687) 
[15:00:04] <jouncebot>	 manybubbles, anomie, ^d, thcipriani, marktraceur, matt_flaschen, Dereckson: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150505T1500). Please do the needful.
[15:00:30] <icinga-wm>	 ACKNOWLEDGEMENT - Host labstore1002 is DOWN: PING CRITICAL - Packet loss = 100% Coren Powered down - Hardware issues
[15:00:32] <godog>	 paravoid: mhh I'm seeing only the 500 alert? which should go away once https://gerrit.wikimedia.org/r/#/c/208083/ is merged
[15:00:51] <paravoid>	 godog: and a bunch of UNKNOWNs
[15:02:46] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0]
[15:02:48] <godog>	 paravoid: true, the jobq ones are https://gerrit.wikimedia.org/r/#/c/207785/ just added you if you want to take a shot
[15:06:28] <thcipriani>	 matt_flaschen: seems to keep failing on EchoEmailFormatterTest::testEmailFormatter
[15:06:53] <matt_flaschen>	 thcipriani, yeah, I'm trying to figure out why.  I didn't change anything related to that.
[15:07:45] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[15:08:21] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "IMO we can leave diamond enabled and disable the statsd reporter/handler instead, optionally report to disk" [puppet] - 10https://gerrit.wikimedia.org/r/208924 (https://phabricator.wikimedia.org/T98121) (owner: 10Hashar)
[15:09:27] <Dereckson>	 Hi.
[15:09:55] <thcipriani>	 Dereckson: howdy
[15:10:07] <thcipriani>	 ready to get your config change out the door?
[15:10:20] <Dereckson>	 I'm fine, thanks, and yes I'm ready.
[15:11:05] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208634 (https://phabricator.wikimedia.org/T97995) (owner: 10Dereckson)
[15:11:14] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add medialib.naturalis.nl to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208634 (https://phabricator.wikimedia.org/T97995) (owner: 10Dereckson)
[15:13:55] <thcipriani>	 hmm: ^d anything happening with gerrit being sluggish today? This git fetch took 1m 18s
[15:14:58] <_joe_>	 thcipriani: it's JAVA(TM)
[15:15:16] <_joe_>	 pick your daily garbage collection fuckup/nullpointer exception
[15:15:42] <_joe_>	 I'm sure the gerrit logs have very informative 300 lines stack traces
[15:15:58] <_joe_>	 for things totally unrelated to the actual problems, too
[15:16:55] <logmsgbot>	 !log thcipriani Synchronized wmf-config/InitialiseSettings.php: Add medialib.naturalis.nl to wgCopyUploadsDomains [[gerrit:208634]] (duration: 00m 26s)
[15:17:03] <wikibugs>	 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1260743 (10fgiunchedi) [] match swift capacity to eqiad (+3 machines ATM) and mirror thumbs too
[15:17:12] <Dereckson>	 Testing.
[15:17:18] <thcipriani>	 thanks
[15:17:41] <Dereckson>	 Works.
[15:17:53] <thcipriani>	 _joe_: fair. OK, rephrase: anything _more_ wrong with gerrit than normal today :)
[15:18:00] <thcipriani>	 Dereckson: cool. thanks.
[15:18:15] <Dereckson>	 Thanks for the deploy.
[15:18:45] <_joe_>	 thcipriani: yeah my point is that the final answer will be that :P
[15:19:01] <Dereckson>	 _joe_: Your java punchline makes me smile, I should feel ashamed
[15:19:07] <matanya>	 outage ?
[15:19:13] <matanya>	 (Cannot access the database: Can't connect to MySQL server on '10.64.16.22' (4) (10.64.16.22))
[15:19:59] <_joe_>	 matanya: that's a labs address
[15:20:03] <_joe_>	 where are you seeing that?
[15:20:08] <matanya>	 on he.wiki
[15:20:16] <matanya>	 that is even worse then
[15:20:29] <_joe_>	 matanya: some gadget you installed?
[15:20:31] <_joe_>	 maybe?
[15:20:38] <_joe_>	 try in an incognito window?
[15:20:38] <matanya>	 maybe, but not recently
[15:20:45] <matanya>	 doing
[15:20:51] <_joe_>	 or tell me the steps to reproduce
[15:20:53] <wikibugs>	 6operations, 7Graphite, 5Patch-For-Review: deprecate mwprof from puppet and gerrit - https://phabricator.wikimedia.org/T97509#1260745 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi change merged, bye bye mwprof
[15:21:13] <matanya>	 seems to be working in incognito
[15:21:27] <matanya>	 probably some gadget
[15:21:29] <_joe_>	 matanya: then it's some gadget that works from labs :)
[15:22:11] <matanya>	 thanks
[15:23:48] <matt_flaschen>	 thcipriani, okay, I'm going to withdraw our SWAT.
[15:24:15] <_joe_>	 Dereckson: why should you feel ashamed? :)
[15:24:37] <thcipriani>	 matt_flaschen: OK, sounds good.
[15:25:04] <_joe_>	 Dereckson: I have 5+ years of kicking jvm apps out of garbage collecions under my belt. I earned the right to mock java, the jvm, and all the ecosystem
[15:25:07] <_joe_>	 :)
[15:25:45] <icinga-wm>	 PROBLEM - Host analytics1037 is DOWN: PING CRITICAL - Packet loss = 100%
[15:25:55] <_joe_>	 ottomata: is that you ^^
[15:27:13] <grrrit-wm>	 (03PS1) 10BryanDavis: logstash: remove extra $::ganglia_aggregator [puppet] - 10https://gerrit.wikimedia.org/r/208973 
[15:29:07] <ottomata>	 _joe_: that is cmjohnson1 and I
[15:29:09] <ottomata>	 thanks
[15:32:11] <greg-g>	 thcipriani: are you done with swat then?
[15:32:31] <thcipriani>	 the window is still open, but all code that was scheduled is deployed.
[15:32:36] <greg-g>	 coolio
[15:33:15] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0]
[15:33:35] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[15:35:25] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] logstash: remove extra $::ganglia_aggregator [puppet] - 10https://gerrit.wikimedia.org/r/208973 (owner: 10BryanDavis)
[15:40:48] <icinga-wm>	 RECOVERY - Host analytics1037 is UPING OK - Packet loss = 0%, RTA = 2.02 ms
[15:41:01] <cmjohnson1>	 ottomata: looks good!
[15:41:32] <ottomata>	 ok thanks so much, sorry for the false alarm, dunno what was going on there
[15:41:44] <ottomata>	 cool, can confirm it is back in teh cluster
[15:41:45] <cmjohnson1>	 oh.no false alarm...the disk was missing..idk how
[15:42:24] <cmjohnson1>	 but it's back now
[15:44:48] <wikibugs>	 6operations, 10ops-eqiad: /dev/sdm not loading on analytics1037 - https://phabricator.wikimedia.org/T98081#1260802 (10Cmjohnson) 5Open>3Resolved VD13 was missing and the disk was in foreign cfg.   I cleared the foreign config and re-created the virtual disk.  While the server was down I also did some firmw...
[15:48:35] <wikibugs>	 6operations, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Investigation if Fluorine needs bigger disks or we retain too much data - https://phabricator.wikimedia.org/T92417#1260810 (10fgiunchedi) `sdb` swapped, root and swap arrays rebuilt already, data arrays rebuilding  ``` md5 : active raid1 sdb4[1] sda4...
[15:52:44] <andrewbogott>	 bblack: Ahah!  Ganglia /is/ showing metrics for labvirt1007/1008 but it’s displaying them under their old names, virt1011/1012.
[15:54:31] <grrrit-wm>	 (03PS5) 10Giuseppe Lavagetto: hiera: Add a proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/207128 (https://phabricator.wikimedia.org/T93776) 
[15:54:35] <grrrit-wm>	 (03Draft1) 10Hashar: (WIP) vmbuilder with puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/208939 
[15:56:19] <yurik>	 akosiaris, ping
[15:58:15] <yurik>	 ottomata, do you know much about using trebuchet? I need to update graphoid service in prod, but would like someone to hold my hand for the first time ))
[15:59:19] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: Add a proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/207128 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto)
[16:01:40] <_joe_>	 ok, since nothing is failing horribly (and it shouldn't, really) I'm going off now.
[16:10:37] <wikibugs>	 6operations, 5Patch-For-Review: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1260862 (10RobH)
[16:10:37] <godog>	 andrewbogott: thoughts on https://gerrit.wikimedia.org/r/#/c/205553/ ?
[16:10:53] <wikibugs>	 6operations, 5Patch-For-Review: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1260863 (10RobH) a:5RobH>3Springle
[16:11:03] <andrewbogott>	 godog: I’m in a meeting but I will look.  I sorta thought we fixed that already...
[16:12:34] <godog>	 andrewbogott: ack, no we didn't, hence the code review
[16:15:03] <wikibugs>	 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie  and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1260868 (10Ottomata) 3NEW a:3Ottomata
[16:15:24] <wikibugs>	 6operations, 10Analytics-Cluster, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1260877 (10Ottomata)
[16:15:25] <wikibugs>	 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie  and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1260876 (10Ottomata)
[16:15:42] <wikibugs>	 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie  and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1260868 (10Ottomata)
[16:15:43] <wikibugs>	 6operations, 10Analytics-Cluster, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1115779 (10Ottomata)
[16:15:51] <wikibugs>	 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie  and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1260868 (10Ottomata)
[16:15:52] <wikibugs>	 6operations, 10Analytics-Cluster, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1115779 (10Ottomata)
[16:18:23] <wikibugs>	 6operations, 6Project-Creators, 7Documentation: create #vm-requests (a production vm cluster request project similar to #hardware-requests) - https://phabricator.wikimedia.org/T97330#1260882 (10RobH) 5Open>3Resolved I neglected the project creator link, my bad.  (I put this task in specifically for that!...
[16:18:25] <wikibugs>	 6operations, 7Documentation: Create documentation on the requesting/allocation of virtual machines in the misc cluster - https://phabricator.wikimedia.org/T97072#1260884 (10RobH)
[16:19:35] <grrrit-wm>	 (03PS1) 10BryanDavis: Update statsd events [tools/scap] - 10https://gerrit.wikimedia.org/r/208987 (https://phabricator.wikimedia.org/T64667) 
[16:22:41] <grrrit-wm>	 (03PS1) 10Ori.livneh: wmgUseBits: default => false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 
[16:22:47] <ori>	 bblack: ^
[16:23:35] <grrrit-wm>	 (03CR) 10BryanDavis: Update statsd events (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/208987 (https://phabricator.wikimedia.org/T64667) (owner: 10BryanDavis)
[16:24:07] <James_F>	 ori: Gosh.
[16:25:02] <paravoid>	 wait, we're done with bits already?
[16:25:06] <paravoid>	 that was fast
[16:25:36] <James_F>	 paravoid: Right now it's off for it,de,nl,ru,es-wiki and a few small ones.
[16:25:40] <paravoid>	 I know
[16:25:52] <James_F>	 paravoid: Off for enwiki too might be a prudent penultimate step, but…
[16:26:11] <yurik_>	 greg-g, i have moved up the graph ext depl to the slot 9:30-11am. Hope its ok (it is in betalabs as requested :))
[16:26:47] <wikibugs>	 6operations, 7Documentation: Create documentation on the requesting/allocation of virtual machines in the misc cluster - https://phabricator.wikimedia.org/T97072#1260903 (10RobH) a:5RobH>3akosiaris Ok, #vm-requests now exists, identical to hardware requests.  I've updated https://wikitech.wikimedia.org/wik...
[16:27:04] <yurik_>	 greg-g,  it was previously scheduled at 1pm, but its a bit late for this tz
[16:27:44] <greg-g>	 "beta cluster"
[16:28:51] <grrrit-wm>	 (03PS4) 10Gage: ipsec-global: fix bug in non-verbose mode, exit if not root [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) 
[16:28:58] <greg-g>	 yurik_: kk
[16:29:40] <legoktm>	 where is the "<link rel="dns-prefetch" href="//meta.wikimedia.org" />" coming from on https://www.mediawiki.org/wiki/MediaWiki ?
[16:29:42] <grrrit-wm>	 (03CR) 10Gage: [C: 032] ipsec-global: fix bug in non-verbose mode, exit if not root [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage)
[16:30:24] <legoktm>	 ori: I'm guessing you might know ^?
[16:31:10] <ori>	 legoktm: CentralNotice IIRC
[16:31:31] <legoktm>	 yup, thanks :)
[16:31:41] <akosiaris>	 yurik_: yes I am now around
[16:31:48] <akosiaris>	 yurik_: re trebuchet
[16:32:02] <yurik_>	 akosiaris, awesoem, i have the next 1.5 hrs of depl time, would love your assistance
[16:32:21] <grrrit-wm>	 (03PS10) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) 
[16:32:29] <yurik_>	 akosiaris, can you do hangout ?
[16:32:45] <akosiaris>	 yup
[16:33:18] <Coren>	 James_F: If you want Neil to get his access to stat1003 can you poke him and point him at the ticket?  There are actions from him needed.
[16:34:54] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1260936 (10coren) @trevorparscal: With your approval language, this will be good to go.
[16:35:42] <James_F>	 Coren: neilpquinn is pinged.
[16:37:16] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Editing-Department: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1260945 (10Jdforrester-WMF)
[16:37:32] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Give Neil Quinn access to stats1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97746#1260947 (10Jdforrester-WMF)
[16:39:15] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Update statsd events [tools/scap] - 10https://gerrit.wikimedia.org/r/208987 (https://phabricator.wikimedia.org/T64667) (owner: 10BryanDavis)
[16:44:39] <wikibugs>	 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1260966 (10fgiunchedi) for anything that's countable as lines that'll help getting us off udp2log so I think it'll work. For anything more complex than that (e.g. timings) I think we'll have to roll some...
[16:45:42] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Editing-Department: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1260968 (10coren) p:5Triage>3Normal
[16:46:06] <icinga-wm>	 RECOVERY - Parsoid on wtp2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.115 second response time
[16:46:16] <icinga-wm>	 RECOVERY - puppet last run on wtp2003 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures
[16:46:25] <icinga-wm>	 RECOVERY - Parsoid on wtp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.117 second response time
[16:47:15] <icinga-wm>	 RECOVERY - puppet last run on wtp2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:47:27] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1260970 (10coren) @dr0ptp4kt: Care to place your imprimatur on this?
[16:48:01] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 031] "I can't claim to understand what this does, but it's fine w/me :)" [puppet] - 10https://gerrit.wikimedia.org/r/205553 (https://phabricator.wikimedia.org/T92712) (owner: 10Filippo Giunchedi)
[16:48:08] <grrrit-wm>	 (03PS1) 10Dzahn: admin: add shell user for guillom [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) 
[16:48:14] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1260974 (10dr0ptp4kt) Approved.
[16:48:25] <dr0ptp4kt>	 @Coren ^
[16:48:44] <mutante>	 @Coren ^^
[16:49:34] <wikibugs>	 6operations, 10Traffic, 7discovery-system: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1260975 (10Joe) So, given we chose to go ahead with etcd, we will use [[ https://github.com/kelseyhightower/confd | confd  ]] for writing a single...
[16:49:42] <Coren>	 Yup.  I saw.  :-)
[16:49:46] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] "needs the UID, which we would match with the LDAP/labs/wikitech user, but there "guillom" doesn't exist yet, @guillom can you create that " [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn)
[16:50:28] <yurik_>	 !log deployed latest graphoid 0.1.3 service
[16:50:37] <morebots>	 Logged the message, Master
[16:50:56] <Coren>	 mutante: I would have created the account on Wikitech myself, simply.  :-)
[16:51:23] <guillom>	 mutante: I have an account on Wikitech/Labs. Could it be a capitalization issue?
[16:51:29] <mutante>	 Coren: eh, i just noticed the issue seems to be entirely different.. the ldap tools on terbium dont work?
[16:51:46] <Coren>	 mutante: I don't know, that's not where I check from normally.  :-)
[16:51:54] <guillom>	 heh
[16:52:08] <grrrit-wm>	 (03CR) 10MaxSem: [C: 031] Enable fallback graphoid service for non-js client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207631 (owner: 10Yurik)
[16:52:28] <Coren>	 mutante: His username is 'gpaumier'
[16:52:30] <mutante>	 it's kind of the official place
[16:52:35] <Coren>	 (uid 2047)
[16:52:38] <mutante>	 there's role for ldap tools and admins
[16:52:38] <grrrit-wm>	 (03CR) 10Yurik: [C: 032] Enable fallback graphoid service for non-js client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207631 (owner: 10Yurik)
[16:52:46] <guillom>	 Oh, ugh. Sorry about the mixup.
[16:52:51] <mutante>	 aah! that explains 
[16:53:03] <Coren>	 mutante: No, it works fine there too.  Username mismatch.  :-)
[16:53:09] <mutante>	 yes, indeed
[16:53:09] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Assign graphoid-admin to the SCA cluster [puppet] - 10https://gerrit.wikimedia.org/r/208998 
[16:53:32] <mutante>	 so, ehm.. usually i would recommend the same name for production but shrug?
[16:53:45] <guillom>	 mutante: I'm fine with either
[16:53:54] <wikibugs>	 6operations, 10Traffic, 7discovery-system: Figure out an etcd deploy strategy that includes multi DC failure scenarios. - https://phabricator.wikimedia.org/T98165#1260982 (10Joe) 3NEW
[16:54:06] <guillom>	 I I just didn't realize that the labs username was different from the wikitech username
[16:54:07] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1260988 (10TrevorParscal) Approved.
[16:54:11] <guillom>	 Too many usernames!
[16:54:30] <mutante>	 yea, it's "shell name" vs. "wiki name"
[16:54:48] <guillom>	 Sorry for the misunderstanding
[16:54:53] <mutante>	 and then in puppet for production shell there is the resource name and another "name: " and "realname"
[16:55:05] <mutante>	 :) no worries. amending
[16:55:10] <guillom>	 I'm fine with "gpaumier"
[16:55:15] <mutante>	 ok
[16:55:19] <guillom>	 Thanks :)
[16:55:37] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1260992 (10coren) @niedzielski: Please post a SSH key you will use, and review and sign L3
[16:55:42] <grrrit-wm>	 (03PS2) 10Dzahn: admin: add shell user for guillom [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) 
[16:57:02] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1260995 (10Dzahn) wikitech user is gpaumier. uploaded a patch that creates a "gpaumier" user in prod.
[16:58:34] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable fallback graphoid service for non-js client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207631 (owner: 10Yurik)
[16:58:38] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add andyrussg to udp2log-users group to allow him to verify kafkatee generated fundraising log files on erbium - https://phabricator.wikimedia.org/T97860#1260999 (10coren) @k4-713: Can I get your approval language for this, please?
[16:59:24] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add andyrussg to udp2log-users group to allow him to verify kafkatee generated fundraising log files on erbium - https://phabricator.wikimedia.org/T97860#1261001 (10K4-713) Approved!
[17:00:49] <logmsgbot>	 !log yurik Synchronized wmf-config/CommonSettings.php: Enable graphoid noscript fallback for graph ext (duration: 00m 20s)
[17:00:55] <morebots>	 Logged the message, Master
[17:01:07] <grrrit-wm>	 (03PS1) 10Dzahn: admin: add gpaumier to ana-priv-data and bastion [puppet] - 10https://gerrit.wikimedia.org/r/209001 (https://phabricator.wikimedia.org/T98077) 
[17:02:34] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add andyrussg to udp2log-users group to allow him to verify kafkatee generated fundraising log files on erbium - https://phabricator.wikimedia.org/T97860#1261007 (10coren) a:3coren
[17:02:56] <grrrit-wm>	 (03PS1) 10coren: Add andyrussg to udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/209002 (https://phabricator.wikimedia.org/T97860) 
[17:03:19] <grrrit-wm>	 (03CR) 10John F. Lewis: [C: 031] admin: add shell user for guillom [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn)
[17:03:48] <grrrit-wm>	 (03CR) 10John F. Lewis: [C: 031] admin: add gpaumier to ana-priv-data and bastion [puppet] - 10https://gerrit.wikimedia.org/r/209001 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn)
[17:04:14] <grrrit-wm>	 (03PS3) 10Dzahn: admin: add shell user for guillom [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) 
[17:04:38] <AndyRussG|taxi>	 Coren: merci bien!
[17:04:42] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1261013 (10Niedzielski)     ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCvAn78DE2Wj3QZFQSYe19eAqcGWZXQZA/TPuDjFtSdBU9yqsdcWUzfpN8ZN+dpvvQyBLbKf2MxYD2ghoo0WdUdcRoxB/7XyP5xsHLW4BRYtEf0XlPP9uC...
[17:05:54] <wikibugs>	 6operations, 10ops-eqiad: Failed disk db1004 - https://phabricator.wikimedia.org/T97814#1261019 (10Cmjohnson) 5Open>3Resolved disk 10 is online.
[17:06:04] <grrrit-wm>	 (03CR) 10coren: [C: 032] "Simple group addition." [puppet] - 10https://gerrit.wikimedia.org/r/209002 (https://phabricator.wikimedia.org/T97860) (owner: 10coren)
[17:07:56] <grrrit-wm>	 (03PS4) 10Dzahn: admin: add shell user for guillom [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) 
[17:08:16] <Steinsplitter>	 guillom :-)
[17:10:17] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "doing this part because it just sets up the user without groups. leaving the other one as the actual access requests" [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn)
[17:11:09] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "it'll make sure cgroup-bin is installed and all cgroups are where mw's limit.sh expects them, will merge tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/205553 (https://phabricator.wikimedia.org/T92712) (owner: 10Filippo Giunchedi)
[17:12:08] <grrrit-wm>	 (03PS2) 10Dzahn: admin: add gpaumier to ana-priv-data and bastion [puppet] - 10https://gerrit.wikimedia.org/r/209001 (https://phabricator.wikimedia.org/T98077) 
[17:13:06] <gwicke>	 akosiaris: could you check https://wikitech.wikimedia.org/wiki/Graphoid for correctness?
[17:13:20] <grrrit-wm>	 (03CR) 10Dzahn: "@Coren this one for you then to confirm" [puppet] - 10https://gerrit.wikimedia.org/r/209001 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn)
[17:13:59] <wikibugs>	 6operations, 10hardware-requests: Eqiad: 1 hardware access request for puppetmaster service scale out - https://phabricator.wikimedia.org/T98166#1261036 (10akosiaris) 3NEW
[17:14:44] <akosiaris>	 gwicke: yeah, on it
[17:15:00] <gwicke>	 akosiaris: thanks!
[17:15:45] <icinga-wm>	 PROBLEM - puppet last run on mw2031 is CRITICAL puppet fail
[17:16:27] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "some might argue whether this should be in operations/software or here if it's not used by puppet, but i think it would be fine" [puppet] - 10https://gerrit.wikimedia.org/r/208395 (owner: 10Alex Monk)
[17:17:10] <grrrit-wm>	 (03CR) 10Dzahn: "+chasemp" [puppet] - 10https://gerrit.wikimedia.org/r/208395 (owner: 10Alex Monk)
[17:17:39] <wikibugs>	 6operations, 5Patch-For-Review: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1261053 (10akosiaris)
[17:17:58] <wikibugs>	 6operations, 10hardware-requests: Eqiad: 1 hardware access request for puppetmaster service scale out - https://phabricator.wikimedia.org/T98166#1261057 (10akosiaris)
[17:18:00] <wikibugs>	 6operations, 5Patch-For-Review: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1260135 (10akosiaris)
[17:18:15] <icinga-wm>	 PROBLEM - puppet last run on analytics1016 is CRITICAL Puppet has 1 failures
[17:19:18] <robh>	 akosiaris: half of hte time the requestor of the hardware knows the vlan
[17:19:27] <robh>	 and if they dont specify, i end up asking anyhow
[17:19:31] <robh>	 so im gonna keep asking ;D
[17:19:43] <robh>	 so you dont want a much better system for this eh?  just same as palladium...
[17:20:01] <akosiaris>	 well, mem/disk would be a waste
[17:20:15] <akosiaris>	 and we do put into good use our older hardware
[17:20:28] <akosiaris>	 but if you got a box with more CPU and same disk/mem specs
[17:20:30] <akosiaris>	 I 'd be happy
[17:20:52] <akosiaris>	 btw, how am I supposed to know the vlan in this case ?
[17:21:02] <akosiaris>	 you can pick up any box from any row
[17:21:09] <akosiaris>	 s/pick up/pick/
[17:21:17] <akosiaris>	 and VLANs are according to rows :-)
[17:21:47] <robh>	 public, private, labs, analytics, sandbox
[17:22:08] <wikibugs>	 6operations, 10Analytics-Cluster, 5Patch-For-Review: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1261065 (10coren)
[17:22:10] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add andyrussg to udp2log-users group to allow him to verify kafkatee generated fundraising log files on erbium - https://phabricator.wikimedia.org/T97860#1261063 (10coren) 5Open>3Resolved This should be applied now, or will be very shortly (next pupp...
[17:22:12] <robh>	 and if its analytics or labs, then its row dependent on where i can put it since htey dont exist in all rows
[17:22:32] <robh>	 but, if they dont know, they put 'i dont know' and answer further questions to help me decide
[17:22:49] <robh>	 I thought about putting all the vlan options in there, but they change and I don't want to replicate that work.
[17:23:14] <Krenair>	 where is the authoritative list of vlans?
[17:23:27] <robh>	 good question, no idea.
[17:23:35] <robh>	 so the switches can have them, or the dns template files
[17:23:45] <robh>	 but, just having it in one doesnt mean its in the other
[17:23:50] <robh>	 so you have to look at both and just figure it out
[17:24:05] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 6.67% of data above the critical threshold [500.0]
[17:24:15] <JohnLewis>	 akosiaris: I just laughed at "bad question. This is a technicality that should be added by Ops, not request the user to provide it." :)
[17:24:17] <robh>	 or, if you happen to have another machine, like a sister/mirror, folks put things like 'same vlan as server x'
[17:24:25] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
[17:24:41] <robh>	 im glad you guys actually read my email about all these things i sent out months ago =[
[17:24:50] * robh isnt sure why he bothers to update docs
[17:25:35] <JohnLewis>	 robh: because then you can say 'go look at the docs' while you do more useful work
[17:25:39] <akosiaris>	 robh: cause it's good ?
[17:25:56] <akosiaris>	 I actually did read them btw, I just missed the VLAN part
[17:25:57] <godog>	 also because every other time people don't ask you it is harder to notice
[17:27:11] <wikibugs>	 6operations, 10hardware-requests: Eqiad: 1 hardware access request for puppetmaster service scale out - https://phabricator.wikimedia.org/T98166#1261085 (10RobH) a:3RobH
[17:27:51] <robh>	 i dont get why its a bad question when half the folks who request it know the answer.  if they dont, they say so, and we move on.
[17:28:10] <robh>	 so im keeping it since i have to process the requests.
[17:28:38] <robh>	 akosiaris: you would say 'whatever vlan the puppetmaster needs to be in'
[17:28:41] <robh>	 and you're set.
[17:29:05] <robh>	 im not asking for a checklist of vlan names that i expect folks to have memorized, or i'd have listed them all.
[17:29:13] <akosiaris>	 hehe... which is any btw 
[17:29:18] <robh>	 no
[17:29:21] <robh>	 if i put it in labs
[17:29:22] <akosiaris>	 because it can be in any private vlan
[17:29:23] <robh>	 your fucked
[17:29:24] <robh>	 so its not any.
[17:29:29] <akosiaris>	 ahahaha
[17:29:40] <robh>	 and half the requests in the past 6 months have been labs or analytics, which are special vlans
[17:29:41] <akosiaris>	 come on, you wouldn't do that now, would you ?
[17:29:44] <robh>	 so yes, i could remove it
[17:29:48] <robh>	 and then we can have the back and forth
[17:30:03] <robh>	 akosiaris: the oppostie has happened twice though
[17:30:11] <akosiaris>	 sigh
[17:30:11] <robh>	 folks want a machine, i put in default private, and it needs to be somehthing else
[17:30:23] <robh>	 so rather than halfing a 24/48 hour back and forth on every single ticket
[17:30:34] <robh>	 i ask in the initial form, if they dont know, they know im going to ask questions to determine it.
[17:30:34] <akosiaris>	 ok I get your point
[17:30:49] <akosiaris>	 what I am saying is that asking the VLAN explicitly is a technicallity
[17:30:52] <wikibugs>	 6operations, 10ops-eqiad, 10Incident-20141130-eqiad-C4: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1261097 (10faidon) OK, I also `request virtual-chassis mode mixed` the switch and rebooted it to handle it being added into a mixed VC.  The steps for tomorrow would be: 1) @Cmjohns...
[17:31:08] <akosiaris>	 that will confuse some ppl
[17:31:15] <robh>	 akosiaris: then rewrite a form that covers all the info i need and reduces the back and forth accordingly
[17:31:18] <akosiaris>	 but we could rephrase it 
[17:31:30] <robh>	 because what i have now is what ive gotten to with no one bothering to give feevback when i ask
[17:31:46] <akosiaris>	 cause the best feedback is when you don't ask :P
[17:31:53] <grrrit-wm>	 (03PS1) 10Yurik: Enabled graph ext on all wikis except wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209009 
[17:32:06] <akosiaris>	 anyway, I can rewrite that form anyway, I have to review the one for vm-requests
[17:32:21] <robh>	 well, just please check with me, since i have to deal with it
[17:32:21] <akosiaris>	 so I 'll do both. I anyway have to answer the same question for VMs
[17:32:26] <robh>	 you can handle the vm requests however you like
[17:32:26] <akosiaris>	 ok
[17:32:41] <akosiaris>	 but not today. 20:30 here, I am signing off 
[17:33:46] <icinga-wm>	 RECOVERY - puppet last run on mw2031 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures
[17:33:54] <robh>	 akosiaris: sorry if i was snappy, im in a short mood today and didnt sleep
[17:34:00] <robh>	 not your fault, im just being too snappish.
[17:34:02] <Coren>	 I dunno what it is with admin/data/data.yaml but unless I turn syntax highlighting off it makes vim really, really mad
[17:34:35] <robh>	 my brain decided to wake me up every couple of hours for no reason.
[17:35:03] <mutante>	 Coren: it's something with the default script that highlights yaml 
[17:35:53] <grrrit-wm>	 (03CR) 10MaxSem: [C: 031] Enabled graph ext on all wikis except wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209009 (owner: 10Yurik)
[17:35:55] <robh>	 i agree we can rephrase the vlan question into something that is more clear and still provides the information required
[17:36:00] <mutante>	 Coren: http://stackoverflow.com/questions/20663169/vim-really-slow-with-long-yaml
[17:36:13] <mutante>	 Coren: https://github.com/stephpy/vim-yaml
[17:42:14] <grrrit-wm>	 (03PS2) 10Yuvipanda: zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 
[17:42:38] <logmsgbot>	 !log yurik Synchronized php-1.26wmf4/extensions/Graph: Cherrypicked Graph ext 209004 (duration: 00m 20s)
[17:42:46] <morebots>	 Logged the message, Master
[17:42:52] <grrrit-wm>	 (03CR) 10Yuvipanda: "Set the default only for labs, so it'll fail in prod if hiera isn't set for some reason (which is correct behaviour, I think)" [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda)
[17:43:36] <logmsgbot>	 !log yurik Synchronized php-1.26wmf3/extensions/Graph: Cherrypicked Graph ext 209004 (duration: 00m 16s)
[17:43:40] <morebots>	 Logged the message, Master
[17:44:17] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[17:45:09] <grrrit-wm>	 (03CR) 10Yurik: [C: 032] Enabled graph ext on all wikis except wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209009 (owner: 10Yurik)
[17:47:23] <grrrit-wm>	 (03PS1) 10coren: stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) 
[17:47:35] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0]
[17:48:03] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) (owner: 10coren)
[17:48:49] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enabled graph ext on all wikis except wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209009 (owner: 10Yurik)
[17:49:54] <wikibugs>	 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10RobH) 3NEW a:3RobH
[17:50:19] <wikibugs>	 6operations, 10hardware-requests: Eqiad: 1 hardware access request for puppetmaster service scale out - https://phabricator.wikimedia.org/T98166#1261200 (10RobH)
[17:50:23] <wikibugs>	 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261199 (10RobH)
[17:50:26] <logmsgbot>	 !log yurik Synchronized wmf-config/InitialiseSettings.php: Enable graph extension on all wikis except wikidata (duration: 00m 19s)
[17:50:34] <morebots>	 Logged the message, Master
[17:50:40] <wikibugs>	 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10RobH)
[17:50:42] <wikibugs>	 6operations, 5Patch-For-Review: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1261216 (10RobH)
[17:50:43] <wikibugs>	 6operations, 10hardware-requests: Eqiad: 1 hardware access request for puppetmaster service scale out - https://phabricator.wikimedia.org/T98166#1261211 (10RobH) 5Open>3Resolved allocating server rhodium: Dell PowerEdge R610, dual Intel Xeon X5647, 16 GB Memory  resolving this request.  Setup of system is...
[17:50:55] <yuvipanda>	 hi ottomata 
[17:51:01] <yuvipanda>	 I made it work by default on labs :D
[17:51:46] <ottomata>	 oh?
[17:51:58] <ottomata>	 ah new ptach
[17:52:11] <ottomata>	 hm ok that's fine
[17:52:44] <grrrit-wm>	 (03PS2) 10coren: stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) 
[17:52:46] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda)
[17:52:52] <ottomata>	 yuvipanda:  +2 didn't merge
[17:53:20] <wikibugs>	 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261260 (10RobH)
[17:53:22] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1261262 (10coren) All set for the 3-day period set to end May 8.
[17:54:39] <yuvipanda>	 ottomata: haha :D don’t do that :P you should +1 and not merge
[17:54:45] <yuvipanda>	 +2 == merge is how it works on all other repos...
[17:55:04] <ottomata>	 i thought +2 is looks good go ahead and merge. and submit is real merge
[17:55:37] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda)
[17:55:50] <yuvipanda>	 ottomata: nah, if you +2 it you see it through, I think...
[17:55:54] <yuvipanda>	 it’s nto codified anywhere tho
[17:56:01] <yurik_>	 greg-g, its alive :)
[17:56:04] <ottomata>	 whaatevaahhh ok +1ed :_)
[17:56:05] <ottomata>	 :)
[17:56:34] <yuvipanda>	 ottomata: :D I’ll merge it later today?
[17:56:43] <yuvipanda>	 ottomata: what’s failure mode if ZK fails, btw? analytics cluster grinds to a halt?
[17:56:46] <yuvipanda>	 just checking!
[17:56:58] <ottomata>	 kafka explodes, thats all
[17:57:45] <ottomata>	 yuvipanda: , i think if zk dies, or if brokers get misconfigured with bad zk host data, they wont' be able to elect any leaders, and will stop accepting produce requests
[17:57:58] <yurik_>	 MaxSem, akosiaris thank you for your help!  I will write the announcement about the new capability soon
[17:58:21] <ottomata>	 i don't think there will be any consistency problems though, just data will stop flowing in from varnishkafkas
[17:58:43] <yuvipanda>	 ottomata: alright. I am sure several people will kill me if that happens, so I’ll puppet-compiler this and do it carefully :)
[17:59:22] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) (owner: 10coren)
[17:59:24] <grrrit-wm>	 (03PS1) 10RobH: setting rhodium install parameters [puppet] - 10https://gerrit.wikimedia.org/r/209014 
[17:59:52] <ottomata>	 well, yuvipanda good news is, daemons aren't subscribed (better double check with zk)
[18:00:09] <grrrit-wm>	 (03CR) 10RobH: [C: 032] setting rhodium install parameters [puppet] - 10https://gerrit.wikimedia.org/r/209014 (owner: 10RobH)
[18:00:10] <yuvipanda>	 aaah
[18:00:10] <ottomata>	 so, if you make a bad config change, they shouldn't pick it up unless you manually restarted daemons
[18:00:10] <jouncebot>	 twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150505T1800).
[18:00:15] <yuvipanda>	 ottomata: ah, cool.
[18:00:27] <yuvipanda>	 ottomata: I’m going to go to a visa interview now tho
[18:00:32] <ottomata>	 k
[18:00:33] <twentyafterfour>	 ok deployment time
[18:00:40] <yuvipanda>	 ottomata: I’ll brb when the world ends
[18:00:44] <yuvipanda>	 or when the interview finishes
[18:00:46] <yuvipanda>	 whichever is first
[18:00:53] <ottomata>	 hmm, zk is subscribed
[18:00:55] <ottomata>	 kafkas aren't
[18:01:26] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1261409 (10coren) @ottomata: All clear from you, since this is a stats server?
[18:01:49] <ori>	 yuvipanda: good luck!
[18:02:47] <wikibugs>	 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261451 (10RobH)
[18:03:07] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1261454 (10coren) >>! In T97866#1253700, @bearND wrote: > While you're at it, please also add him as a member of  https://wikitech.wikimedia.org/wiki/Nova_Resourc...
[18:03:30] <wikibugs>	 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10RobH)
[18:04:38] <grrrit-wm>	 (03PS2) 10Ori.livneh: wmgUseBits: false for all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 
[18:04:40] <grrrit-wm>	 (03PS1) 10Ori.livneh: wmgUseBits: false for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209016 
[18:04:53] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] wmgUseBits: false for all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 (owner: 10Ori.livneh)
[18:06:05] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1261518 (10Ottomata) All clear from me, but it is not clear what services are being asked for.    See: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#A...
[18:06:48] <grrrit-wm>	 (03PS1) 1020after4: Group1 wikis to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209018 
[18:08:01] <wikibugs>	 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1261544 (10Ottomata) Ja, agreed.  One nice thing about this approach, is the statsd_sender thing reads whatever from stdin, so as long as whatever we come up with can do the same pipe thing, e.g.    varn...
[18:08:02] <Coren>	 Need moar tea
[18:09:12] <grrrit-wm>	 (03PS1) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 
[18:09:35] <twentyafterfour>	 wow for once I have something good to say about gerrit:   it does a good job of matching pretty-printed json against minimized json:  https://gerrit.wikimedia.org/r/#/c/209018/1/wikiversions.json,cm  (smart diff algorithm!)
[18:10:08] <grrrit-wm>	 (03CR) 1020after4: [C: 032] Group1 wikis to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209018 (owner: 1020after4)
[18:10:16] <grrrit-wm>	 (03Merged) 10jenkins-bot: Group1 wikis to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209018 (owner: 1020after4)
[18:12:03] <logmsgbot>	 !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: Group1 wikis to 1.26wmf4
[18:12:44] <grrrit-wm>	 (03CR) 10Krinkle: [C: 031] "Should this be part of a package shared with prod and other uses of MediaWiki?" [puppet] - 10https://gerrit.wikimedia.org/r/205553 (https://phabricator.wikimedia.org/T92712) (owner: 10Filippo Giunchedi)
[18:17:23] <wikibugs>	 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, and 2 others: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1261805 (10daniel) @JanZerebecki: Redirects are serialized like this:   {"entity":"Q23","redirect":"Q42"}  Old style...
[18:17:29] <Krinkle>	 ori: Can you rename wgAssetsHost to $wmgAssetsHost? I spent a minute looking for it in mediawiki-core after realising it was added to wmf-config only.
[18:17:36] <Krinkle>	 (or something like that)
[18:18:20] <grrrit-wm>	 (03PS1) 10Ottomata: [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 
[18:18:59] <bblack>	 ori: here now
[18:19:00] <yurik_>	 csteipp, all's good, new ver deployed, thx to akosiaris 
[18:19:03] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 (owner: 10Ottomata)
[18:19:06] <bblack>	 you decided to split it after all?
[18:19:48] <bblack>	 Krinkle: I think the AssetsHost thing is temporary anyways and will be gone later in the weekl
[18:19:55] <grrrit-wm>	 (03CR) 10Jdlrobson: [C: 031] "How can I get this deployed? Should I schedule it for SWAT ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson)
[18:19:59] <csteipp>	 yurik_: Cool. For tracking, do add a link to the patchset to the bug.
[18:20:04] <bblack>	 not sure, though.  I think wmgUseBits is temporary at least
[18:20:17] <Krinkle>	 well, that variable is named just fine :)
[18:21:18] <icinga-wm>	 PROBLEM - puppet last run on mw1137 is CRITICAL Puppet has 1 failures
[18:22:01] <wikibugs>	 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1261827 (10coren) 3NEW
[18:22:42] <twentyafterfour>	 ori: seems your pretty printing patch didn't work
[18:23:02] <grrrit-wm>	 (03CR) 10Alex Monk: "Or ask Greg for a specific deployment window, but yeah." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson)
[18:23:34] <wikibugs>	 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, and 2 others: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1261851 (10daniel) Btw, if someone can tell me where to find a full history dump of wikidata, I'd be happy to check t...
[18:23:40] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1261852 (10Slaporte)  >>! In T85141#1256595, @JohnLewis wrote: > Still pending an approval form @slaporte (or anyone else from legal who deals with data release).  You can proceed wit...
[18:23:58] <grrrit-wm>	 (03PS2) 10Ottomata: [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 
[18:24:38] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 (owner: 10Ottomata)
[18:25:28] <icinga-wm>	 PROBLEM - puppet last run on rhodium is CRITICAL Puppet has 15 failures
[18:25:57] <ori>	 twentyafterfour: what didn't work?
[18:26:14] <twentyafterfour>	 it didn't pretty print
[18:26:23] <ori>	 it wasn't expected to
[18:26:26] <twentyafterfour>	 oh
[18:26:30] <ori>	 it was expected to just not break :)
[18:26:36] <ori>	 and to pretty-print again once we update tin
[18:26:36] <twentyafterfour>	 ok then I guess it worked ;)
[18:26:47] <icinga-wm>	 PROBLEM - DPKG on rhodium is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[18:26:55] <twentyafterfour>	 Sorry I didn't follow the discussion closely enough
[18:27:08] <grrrit-wm>	 (03PS3) 10Ori.livneh: wmgUseBits: false for all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 
[18:27:14] <ori>	 twentyafterfour: np, thanks for the reviews!
[18:27:31] <ori>	 bblack: yeah, why not.
[18:27:36] <ori>	 ready for all-but-enwiki?
[18:27:53] <twentyafterfour>	 ori: gladly, any time. though my local testing didn't catch the bug ;)
[18:28:24] <bd808>	 twentyafterfour: I guess you could post-process the json file with `python -m json.tool` to make it pretty before pushing up to gerrit for now
[18:28:36] <ori>	 bd808: i thought about that, yeah
[18:28:45] <twentyafterfour>	 bd808: gerrit's diff screen actually handles it ok
[18:28:52] <bd808>	 cool
[18:29:00] <twentyafterfour>	 it formats the diff so that I can read it well enough
[18:29:12] <grrrit-wm>	 (03PS2) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 
[18:29:40] <twentyafterfour>	 it even identified the changes between the pretty and non-pretty version, and highlighted just the version number digit that changed for each wiki
[18:29:54] <twentyafterfour>	 the first time I've ever been impressed with gerrit
[18:30:27] <twentyafterfour>	 gerrit praise == 1; gerrit gripes OVERFLOW
[18:31:28] <bblack>	 ori: yes
[18:31:32] <bblack>	 engage!
[18:32:05] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1261910 (10Dzahn) @qgil Does he have to sign L2?
[18:32:13] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] wmgUseBits: false for all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 (owner: 10Ori.livneh)
[18:32:19] <grrrit-wm>	 (03Merged) 10jenkins-bot: wmgUseBits: false for all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 (owner: 10Ori.livneh)
[18:33:44] <logmsgbot>	 !log ori Synchronized wmf-config/InitialiseSettings.php: I2ee277293: wmgUseBits: false for all but enwiki (duration: 00m 13s)
[18:33:51] <morebots>	 Logged the message, Master
[18:35:01] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1261923 (10csteipp) Using sanitizme.pl seems like the right way to redact this. If that was run, then should be ok for security bugs and deleted comments.
[18:36:21] <ori>	 <Krinkle> ori: Can you rename wgAssetsHost to $wmgAssetsHost? I spent a minute looking for it in mediawiki-core after realising it was added to wmf-config only. <-- yes
[18:36:22] <bblack>	 ori: can already see inbound bits traffic drop for esams: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Bits+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report
[18:37:09] <aude>	 twentyafterfour: there is an issue with wikidata
[18:37:24] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1261931 (10Dzahn) fwiw, i did the LDAP group part  [terbium:~] $ ldaplist -l group nda | grep chill  member: uid=multichill,ou=people,dc=wikimedia,dc=org  but the NDA volunteer process is  (...
[18:37:25] <bblack>	 and plummeting again now of course: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Bits+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report
[18:37:36] <aude>	 can we  put wikidata back on 1.26wmf3 while i investigate?
[18:37:38] <bblack>	 I suspect eqiad won't see such dramatic swings until enwiki goes
[18:37:55] * aude tries to reproduce
[18:38:08] <icinga-wm>	 RECOVERY - puppet last run on mw1137 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures
[18:38:09] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1261933 (10Dzahn) @multichill would you mind signing L2 anyways? It has been approved by legal after you signed your original paper NDA afaict.
[18:38:50] <grrrit-wm>	 (03PS2) 10Ori.livneh: wmgUseBits: false for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209016 
[18:38:52] <grrrit-wm>	 (03PS1) 10Ori.livneh: Rename $wgAssetsHost to $wmgAssetsHost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209026 
[18:38:57] <aude>	 maybe can quickly fix instead
[18:39:07] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Rename $wgAssetsHost to $wmgAssetsHost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209026 (owner: 10Ori.livneh)
[18:39:27] <ori>	 bblack: i feel good about going ahead with enwiki if you're up for it
[18:40:01] <bblack>	 yup may as well.  the primary caching is physically separate in practice anyways
[18:40:17] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] wmgUseBits: false for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209016 (owner: 10Ori.livneh)
[18:40:43] <grrrit-wm>	 (03PS3) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 
[18:41:07] <grrrit-wm>	 (03PS3) 10Ottomata: [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 
[18:41:44] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 (owner: 10Ottomata)
[18:42:17] <grrrit-wm>	 (03Merged) 10jenkins-bot: Rename $wgAssetsHost to $wmgAssetsHost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209026 (owner: 10Ori.livneh)
[18:42:18] <bblack>	 ori: my plan on the cached-bits-refs issues is basically wait for the bits traffic graphs to plane out a bit (probably within a day or two), then look at planning and/or executing some varnish bans with date cutoffs to get rid of the tail end.
[18:42:19] <grrrit-wm>	 (03Merged) 10jenkins-bot: wmgUseBits: false for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209016 (owner: 10Ori.livneh)
[18:43:41] <logmsgbot>	 !log ori Synchronized wmf-config: Ia98fc4c5d: wmgUseBits: false for enwiki (duration: 00m 17s)
[18:43:50] <morebots>	 Logged the message, Master
[18:44:27] <bblack>	 (and then we get into the "wtf is left" investigation at lower priority)
[18:44:39] <grrrit-wm>	 (03PS4) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 
[18:44:57] <icinga-wm>	 PROBLEM - Host rhodium is DOWN: PING CRITICAL - Packet loss = 100%
[18:45:18] <wikibugs>	 6operations, 5Patch-For-Review: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1261955 (10chasemp) serious question, have we ever considered going masterless?  With Etcd as a secret store I think it should be totally doable and allows faster rollouts our the infrast...
[18:45:45] <robh>	 grrrrr
[18:45:57] <robh>	 wtf, whoever decommissioned rhodium didnt do it right
[18:46:05] <robh>	 it wasnt wiped, an dit wasnt pulled from icinga.. wtf
[18:46:06] <bblack>	 <- not it!
[18:46:13] <robh>	 ocg system.
[18:46:22] <robh>	 i dont feel like digging to blame, but annoying =P
[18:47:02] <grrrit-wm>	 (03PS1) 10Ori.livneh: Update $wgULSFontRepositoryBasePath for post-bits world [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 
[18:47:18] <icinga-wm>	 PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193
[18:47:35] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Update $wgULSFontRepositoryBasePath for post-bits world [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 (owner: 10Ori.livneh)
[18:47:40] <grrrit-wm>	 (03Merged) 10jenkins-bot: Update $wgULSFontRepositoryBasePath for post-bits world [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 (owner: 10Ori.livneh)
[18:47:47] <icinga-wm>	 PROBLEM - puppet last run on analytics1037 is CRITICAL Puppet last ran 1 day ago
[18:48:24] <bblack>	 ori: not /w/static/ on 209027?
[18:48:50] <ori>	 bblack: no, just static
[18:48:59] <ori>	 both work but /w/static/ is for back-compat
[18:49:16] <ori>	 example :https://en.wikipedia.org/static/current/extensions/UniversalLanguageSelector/resources/css/ext.uls.buttons.css
[18:49:18] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[18:49:28] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0]
[18:49:41] <logmsgbot>	 !log ori Synchronized wmf-config/CommonSettings.php: I5978a3910: Update $wgULSFontRepositoryBasePath for post-bits world (duration: 00m 18s)
[18:49:50] <bblack>	 ori: the hash stuff assumings /w/static/
[18:49:52] <morebots>	 Logged the message, Master
[18:49:55] <bblack>	 *assumes
[18:50:19] <ori>	 bblack: d'oh, you're right
[18:50:23] <bblack>	 I'll fix it
[18:50:27] <ori>	 no no
[18:50:58] <icinga-wm>	 RECOVERY - puppet last run on analytics1037 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures
[18:51:01] <grrrit-wm>	 (03PS1) 10BBlack: better static-assets regex for now [puppet] - 10https://gerrit.wikimedia.org/r/209028 
[18:51:08] <bblack>	 no no?
[18:51:33] <ori>	 your patch is fine, disregard. but maybe we can add a trailing slash since everything is migrated now?
[18:51:54] <bblack>	 I think that has to wait for all kinds of objects to fall of the cache first that ref the old paths
[18:52:05] <bblack>	 until I go muck with forcing them out, I think the cutoff is like 60d?
[18:52:11] <ori>	 30
[18:52:21] <ori>	 but ok, makes sense
[18:52:29] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] better static-assets regex for now [puppet] - 10https://gerrit.wikimedia.org/r/209028 (owner: 10BBlack)
[18:52:50] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] better static-assets regex for now [puppet] - 10https://gerrit.wikimedia.org/r/209028 (owner: 10BBlack)
[18:53:09] <grrrit-wm>	 (03PS5) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 
[18:56:04] <grrrit-wm>	 (03PS1) 10RobH: rhodium had wrong fqdn [puppet] - 10https://gerrit.wikimedia.org/r/209030 
[18:56:31] <bblack>	 ori: actually I had to go look again, but: 30 is our def TTL, but we don't seem to cap it (probably should)
[18:56:34] <grrrit-wm>	 (03CR) 10RobH: [C: 032] rhodium had wrong fqdn [puppet] - 10https://gerrit.wikimedia.org/r/209030 (owner: 10RobH)
[18:56:53] <bblack>	 I think the only way it could go higher would be with a backend's header saying to do so
[18:59:18] * bblack kick-starts salt, again
[19:11:54] <icinga-wm>	 PROBLEM - Varnishkafka log producer on cp3030 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka
[19:12:24] <bblack>	 ^ I'm already looking at cp3030
[19:13:04] <twentyafterfour>	 aude: sorry I didn't see your message, was offline for a few minutes due to power problems. Is there still an issue?
[19:14:24] <aude>	 twentyafterfour: we have a patch coming
[19:14:37] <aude>	 i'll want to deploy it asap and can do myself
[19:15:53] <twentyafterfour>	 aude: ok let me know if I can help
[19:16:41] <twentyafterfour>	 sorry I watched the fatalmonitor for a while but I didn't notice any issues
[19:16:49] <aude>	 twentyafterfour: it's an exception
[19:17:06] <aude>	 schema change not applied yet and code expecting some field to be there
[19:17:23] * aude waits for jenkins
[19:20:45] <yuvipanda>	 ottomata: gonna merge and babysuit patch now
[19:20:48] <yuvipanda>	 err
[19:20:49] <yuvipanda>	 babysit
[19:21:47] <twentyafterfour>	 aude: I've been trying to come up with an elegant solution to schema changes...for years... and I still can't come up with anything better than the migrations systems most of the big frameworks are using these days.  We should probably adopt something like that as well.
[19:21:54] <icinga-wm>	 RECOVERY - Varnishkafka log producer on cp3030 is OK: PROCS OK: 1 process with command name varnishkafka
[19:22:09] <aude>	 twentyafterfour: it is somewhat complicated since springle has to do most of them
[19:22:16] <aude>	 if it's modifying a large existing table
[19:22:34] <aude>	 would be nice if more people could handle them
[19:23:15] <twentyafterfour>	 yeah. There doesn't seem to be a really good solution for large scale deployment of sql schema changes
[19:23:38] <yuvipanda>	 !log disabled puppet on zookeeper hosts
[19:23:46] <twentyafterfour>	 a good dba just can't be automated it seems ;)
[19:23:47] <morebots>	 Logged the message, Master
[19:23:47] <grrrit-wm>	 (03PS3) 10Yuvipanda: zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 
[19:24:04] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda)
[19:24:31] <aude>	 adding tables is easy :)
[19:24:39] <aude>	 but not modifying
[19:24:49] <ottomata>	 yuvipanda:  ok
[19:25:17] <yuvipanda>	 ottomata: disabled puppet on the zookeeper hosts, and have one of the kafka servers open so I can see what puppet drags in
[19:25:47] <grrrit-wm>	 (03CR) 10Yuvipanda: [V: 032] zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda)
[19:26:47] <grrrit-wm>	 (03PS1) 10coren: Labs: Add jamvm explicitly on all flavours [puppet] - 10https://gerrit.wikimedia.org/r/209038 (https://phabricator.wikimedia.org/T98195) 
[19:27:26] <yuvipanda>	 ottomata: yup, all nop :D
[19:27:28] <yuvipanda>	 wheee
[19:27:38] <yuvipanda>	 ottomata: thanks for the review :)
[19:27:49] <mutante>	 ALTER table `dba` ADD column backup
[19:28:08] <ottomata>	 great, thanks yuvipanda :)
[19:28:43] <grrrit-wm>	 (03PS6) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 
[19:29:02] <yuvipanda>	 ottomata: do we have a HDFS role in prod that I can use?
[19:29:12] <ottomata>	 in pupppet you mean?
[19:29:15] <ottomata>	 to use in labs?
[19:29:15] <yuvipanda>	 ottomata: yeah
[19:29:17] <yuvipanda>	 yeah
[19:29:18] <ottomata>	 yes
[19:29:31] * aude shall deploy now
[19:29:43] <wikibugs>	 6operations, 7Shinken: Shinken hostname column is not large enough - https://phabricator.wikimedia.org/T1362#1262124 (10hashar)
[19:29:45] <yuvipanda>	 ottomata: how hard will it be, to, say, have a 5 node cluster?
[19:30:02] <ottomata>	 yuvipanda: not hard, but i haven't done work to make it work with hiera yet
[19:30:10] <yuvipanda>	 ottomata: ah, hmm.
[19:30:12] <ottomata>	 looking for instructinos
[19:30:14] <icinga-wm>	 PROBLEM - puppet last run on cp4019 is CRITICAL puppet fail
[19:30:26] <yuvipanda>	 ottomata: the zookeeper patch was essentially making it work with hiera :)
[19:30:43] <icinga-wm>	 PROBLEM - puppet last run on cp4001 is CRITICAL puppet fail
[19:30:44] <icinga-wm>	 PROBLEM - puppet last run on cp3041 is CRITICAL puppet fail
[19:30:51] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1262133 (10Multichill) >>! In T87097#1261933, @Dzahn wrote: > @multichill would you mind signing L2 anyways? It has been approved by legal after you signed your original paper NDA afaict.  Y...
[19:30:51] <ottomata>	 well, the hadoop role kinda works the same way
[19:30:53] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 7.14% of data above the critical threshold [500.0]
[19:30:53] <icinga-wm>	 PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail
[19:30:53] <icinga-wm>	 PROBLEM - puppet last run on cp3006 is CRITICAL puppet fail
[19:30:54] <icinga-wm>	 PROBLEM - puppet last run on cp3004 is CRITICAL puppet fail
[19:30:54] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
[19:30:58] <ottomata>	 expecting global vars from labsconsole interface
[19:31:04] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail
[19:31:12] <twentyafterfour>	 mutante: good one.
[19:31:18] <ottomata>	 joal: can you find that really  nice wiki page from qchris on how to set up hadoop in labs?
[19:31:24] <icinga-wm>	 PROBLEM - puppet last run on cp4005 is CRITICAL puppet fail
[19:31:40] <joal>	 ottomata: For sure, give me aminute
[19:31:40] <ottomata>	 ah, found it!
[19:31:42] <ottomata>	 https://wikitech.wikimedia.org/wiki/User:QChris/TestClusterSetup
[19:31:43] <ottomata>	 nm
[19:32:04] <icinga-wm>	 PROBLEM - puppet last run on cp4018 is CRITICAL puppet fail
[19:32:04] <icinga-wm>	 PROBLEM - puppet last run on cp3003 is CRITICAL puppet fail
[19:32:33] <icinga-wm>	 PROBLEM - puppet last run on cp3005 is CRITICAL puppet fail
[19:32:54] <ori>	 bblack: ^ ?
[19:33:23] <icinga-wm>	 PROBLEM - puppet last run on cp3009 is CRITICAL puppet fail
[19:33:46] <bblack>	 ESC[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item zookeeper_hosts in any Hiera data file and no default supplied at /etc/puppet/manifests/role/analytics/kafka.pp:109 on node cp3041.esams.wmnet
[19:33:50] <bblack>	 who broke it? :P
[19:33:53] <icinga-wm>	 PROBLEM - puppet last run on cp3035 is CRITICAL puppet fail
[19:33:59] <yuvipanda>	 bblack: ugh
[19:34:01] <yuvipanda>	 bblack: that’s me.
[19:34:03] <bblack>	 ok
[19:34:05] <icinga-wm>	 PROBLEM - puppet last run on cp3031 is CRITICAL puppet fail
[19:34:08] <yuvipanda>	 bblack: but, I have it in eqiad.yaml
[19:34:15] <yuvipanda>	 so that should work...
[19:34:15] <bblack>	 but these hosts are not in eqiad
[19:34:18] <yuvipanda>	 aaarggh
[19:34:19] <yuvipanda>	 I see
[19:34:21] <bblack>	 lol
[19:34:23] <yuvipanda>	 I didn’t know it was cross dc
[19:34:32] <yuvipanda>	 let me move that then
[19:34:36] <yuvipanda>	 I tested on a eqiad kafka host
[19:34:38] * aude waits for jenkins
[19:34:41] <yuvipanda>	 but I guess that’s the broker
[19:35:14] <icinga-wm>	 PROBLEM - puppet last run on cp3034 is CRITICAL puppet fail
[19:35:34] <icinga-wm>	 PROBLEM - puppet last run on cp3049 is CRITICAL puppet fail
[19:35:47] <bblack>	 !log rebooting cp3030 ...
[19:35:56] <grrrit-wm>	 (03PS1) 10Yuvipanda: kafka: Move hiera data for zookeepr hosts to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/209041 
[19:35:58] <morebots>	 Logged the message, Master
[19:36:19] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] kafka: Move hiera data for zookeepr hosts to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/209041 (owner: 10Yuvipanda)
[19:36:24] <icinga-wm>	 PROBLEM - puppet last run on cp4020 is CRITICAL puppet fail
[19:36:34] <icinga-wm>	 PROBLEM - puppet last run on cp3019 is CRITICAL puppet fail
[19:37:23] <icinga-wm>	 PROBLEM - puppet last run on cp4002 is CRITICAL puppet fail
[19:37:35] <icinga-wm>	 PROBLEM - puppet last run on cp4009 is CRITICAL puppet fail
[19:37:42] <yuvipanda>	 uhm
[19:37:44] <icinga-wm>	 PROBLEM - puppet last run on cp3039 is CRITICAL puppet fail
[19:37:59] <grrrit-wm>	 (03PS1) 10Mattflaschen: Flow should use VE by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209042 (https://phabricator.wikimedia.org/T98168) 
[19:38:03] <icinga-wm>	 PROBLEM - puppet last run on cp4012 is CRITICAL puppet fail
[19:38:04] <yuvipanda>	 bblack: can’t I get to these hosts from iron?
[19:38:13] <icinga-wm>	 PROBLEM - puppet last run on cp3018 is CRITICAL puppet fail
[19:38:13] <icinga-wm>	 PROBLEM - puppet last run on cp3021 is CRITICAL puppet fail
[19:38:14] <icinga-wm>	 PROBLEM - puppet last run on cp3047 is CRITICAL puppet fail
[19:38:14] <icinga-wm>	 PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100%
[19:39:12] <bblack>	 lol
[19:39:20] <bblack>	 and yes
[19:39:46] <bblack>	 try hooft for esams maybe? that's what I use, but it shouldn't be necessary
[19:39:54] <icinga-wm>	 PROBLEM - puppet last run on cp4010 is CRITICAL puppet fail
[19:39:54] <icinga-wm>	 PROBLEM - puppet last run on cp3048 is CRITICAL puppet fail
[19:39:54] <icinga-wm>	 PROBLEM - puppet last run on cp3017 is CRITICAL puppet fail
[19:40:05] <icinga-wm>	 PROBLEM - puppet last run on cp3038 is CRITICAL puppet fail
[19:40:14] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL puppet fail
[19:40:34] <icinga-wm>	 PROBLEM - puppet last run on cp4016 is CRITICAL puppet fail
[19:40:41] <ori>	 cp3030 is down?
[19:41:04] <icinga-wm>	 PROBLEM - puppet last run on cp4011 is CRITICAL puppet fail
[19:41:16] <mutante>	 no, he rebooted it
[19:41:37] <ori>	 mutante: thanks, i missed that log line
[19:41:56] <logmsgbot>	 !log aude Synchronized php-1.26wmf4/extensions/Wikidata: Fix usage tracking issue on Wikidata (duration: 00m 40s)
[19:42:04] <morebots>	 Logged the message, Master
[19:42:13] <icinga-wm>	 RECOVERY - Host cp3030 is UPING OK - Packet loss = 0%, RTA = 88.96 ms
[19:42:53] <grrrit-wm>	 (03PS1) 10Thcipriani: Deployment group for trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/209045 (https://phabricator.wikimedia.org/T97775) 
[19:43:52] <bblack>	 heh "UPING OK" -. bad de-dupe regex on icinga alerts?
[19:44:48] <logmsgbot>	 !log aude Synchronized php-1.26wmf4/extensions/Wikidata: Fix usage tracking issue on Wikidata - with submodule update (duration: 00m 33s)
[19:44:53] <morebots>	 Logged the message, Master
[19:45:20] * aude is done
[19:47:14] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0]
[19:47:14] <icinga-wm>	 RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures
[19:47:14] <icinga-wm>	 RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures
[19:47:23] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[19:47:24] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures
[19:48:14] <icinga-wm>	 RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures
[19:48:19] <wikibugs>	 7Puppet, 6operations, 10Beta-Cluster, 5Patch-For-Review: Trebuchet on deployment-bastion: wrong group owner - https://phabricator.wikimedia.org/T97775#1262210 (10thcipriani) Pushed my patch up and attached to this bug.  As I was reviewing this patch, I actually think it may be a better idea to have the dep...
[19:48:44] <icinga-wm>	 RECOVERY - puppet last run on cp4001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:48:44] <icinga-wm>	 RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:48:51] <wikibugs>	 7Puppet, 6operations, 10Beta-Cluster, 5Patch-For-Review: Trebuchet on deployment-bastion: wrong group owner - https://phabricator.wikimedia.org/T97775#1262216 (10thcipriani)
[19:48:54] <icinga-wm>	 RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:48:54] <icinga-wm>	 RECOVERY - puppet last run on cp3005 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures
[19:49:26] <grrrit-wm>	 (03PS1) 10BBlack: purge intel-microcode, will remove after [puppet] - 10https://gerrit.wikimedia.org/r/209048 
[19:49:33] <icinga-wm>	 RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:49:44] <icinga-wm>	 RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures
[19:50:04] <icinga-wm>	 RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:50:04] <icinga-wm>	 RECOVERY - puppet last run on cp3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:51:14] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] purge intel-microcode, will remove after [puppet] - 10https://gerrit.wikimedia.org/r/209048 (owner: 10BBlack)
[19:51:34] <icinga-wm>	 RECOVERY - puppet last run on cp3034 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures
[19:51:53] <icinga-wm>	 RECOVERY - puppet last run on cp3035 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:52:04] <icinga-wm>	 RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:52:10] <twentyafterfour>	 hmm my deploy earlier didn't get logged
[19:52:25] <twentyafterfour>	 where were you, morebots ?
[19:52:59] <greg-g>	 weiird
[19:53:34] <icinga-wm>	 RECOVERY - puppet last run on cp3049 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures
[19:53:42] <twentyafterfour>	 !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: Group1 wikis to 1.26wmf4 (actual time 18:12 UTC)
[19:53:49] <morebots>	 Logged the message, Master
[19:53:57] <twentyafterfour>	 there we go
[19:54:04] <icinga-wm>	 RECOVERY - puppet last run on cp3039 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures
[19:54:24] <icinga-wm>	 RECOVERY - puppet last run on cp4020 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures
[19:54:24] <icinga-wm>	 RECOVERY - puppet last run on cp4012 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures
[19:54:34] <icinga-wm>	 RECOVERY - puppet last run on cp3019 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures
[19:55:23] <icinga-wm>	 RECOVERY - puppet last run on cp4002 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures
[19:55:43] <icinga-wm>	 RECOVERY - puppet last run on cp4009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:57:23] <icinga-wm>	 RECOVERY - puppet last run on cp3021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:57:23] <icinga-wm>	 RECOVERY - puppet last run on cp3018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:57:23] <icinga-wm>	 RECOVERY - puppet last run on cp3047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:57:25] <icinga-wm>	 RECOVERY - puppet last run on cp4011 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures
[19:57:53] <icinga-wm>	 RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:57:53] <icinga-wm>	 RECOVERY - puppet last run on cp3017 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures
[19:57:53] <icinga-wm>	 RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures
[19:58:04] <icinga-wm>	 RECOVERY - puppet last run on cp3038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:58:04] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:58:25] <icinga-wm>	 RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:03:20] <wikibugs>	 10Ops-Access-Requests, 6operations: Add niedzielski release-mobile and deployment-prep project - https://phabricator.wikimedia.org/T98179#1262260 (10Krenair)
[20:03:24] <wikibugs>	 10Ops-Access-Requests, 6operations: Add niedzielski release-mobile and deployment-prep project - https://phabricator.wikimedia.org/T98179#1261519 (10Krenair) deployment-prep (labs) is very separate to release-mobile (prod)...
[20:03:39] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile and deployment-prep project - https://phabricator.wikimedia.org/T98179#1262264 (10Krenair)
[20:06:03] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0]
[20:06:22] <grrrit-wm>	 (03PS1) 10Ori.livneh: cpufrequtils: ensure configure governor is in use [puppet] - 10https://gerrit.wikimedia.org/r/209049 
[20:06:24] <ori>	 ^ bblack
[20:08:50] <wikibugs>	 6operations, 10Traffic: Fix cpufrequtils issues on jessie - https://phabricator.wikimedia.org/T98203#1262298 (10BBlack) 3NEW a:3BBlack
[20:09:19] <wikibugs>	 6operations, 10Traffic: Reboot caches for kernel 3.19.3 globally - https://phabricator.wikimedia.org/T96854#1262309 (10BBlack)
[20:09:19] <wikibugs>	 6operations, 10Traffic: Fix cpufrequtils issues on jessie - https://phabricator.wikimedia.org/T98203#1262308 (10BBlack)
[20:10:19] <wikibugs>	 6operations, 7Shinken: Shinken hostname column is not large enough - https://phabricator.wikimedia.org/T1362#1262318 (10Dzahn) Try this:    # install the ([[ https://addons.mozilla.org/en-US/firefox/addon/stylish/ | Stylish ]], extension  [[ https://en.wikipedia.org/wiki/Stylish | about ]])   # install [[ http...
[20:10:33] <grrrit-wm>	 (03PS2) 10Krinkle: Add logmsgbot instance for #wikimedia-releng that listens to gallium [puppet] - 10https://gerrit.wikimedia.org/r/197386 (owner: 10Legoktm)
[20:10:39] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add logmsgbot instance for #wikimedia-releng that listens to gallium [puppet] - 10https://gerrit.wikimedia.org/r/197386 (owner: 10Legoktm)
[20:10:51] <bblack>	 ori: it's more complicated than that at least for the caches, see ticket above
[20:10:58] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1262326 (10Multichill) >>! In T87097#1262136, @Krenair wrote: > I think L2 is linked when they view Phabricator, it's a restricted visibility object.  In the phabricator email, it has a link...
[20:11:14] <bblack>	 I'm pretty much stuck on that now until we get the non-trunk kernel installed + booted.  it's due to land tomorrow or thurs.
[20:11:46] <mutante>	 hasharMeeting: YuviPanda|food: https://phabricator.wikimedia.org/T1362#1262318
[20:12:20] <YuviPanda|food>	 mutante: :D thank you!
[20:12:24] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1262334 (10Krenair) >>! In T87097#1262326, @Multichill wrote: >>>! In T87097#1262136, @Krenair wrote: >> I think L2 is linked when they view Phabricator, it's a restricted visibility object....
[20:12:33] <icinga-wm>	 PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:12:51] <bblack>	 ori: probably after the kernel reboot, I'll just drop the hacky cpufrequtils package there and set it some simpler way.  you can do it with a bash one-liner, after all.
[20:13:16] <ori>	 why drop the package?
[20:13:43] <bblack>	 because its whole purpose in these machines lives is to set the governor, and it can't
[20:14:06] <bblack>	 unless I go hack/fix it
[20:14:13] <icinga-wm>	 RECOVERY - DPKG on labmon1001 is OK: All packages OK
[20:14:23] <grrrit-wm>	 (03PS7) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 
[20:14:45] <bblack>	 but why bother when: "for x in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance >$x; done" will suffice?
[20:16:21] <ori>	 bblack: isn't it installed by default?
[20:16:35] <ori>	 and if so, doesn't that mean that you risk having the service configured to set one policy, and puppet to set another?
[20:16:36] <bblack>	 oh, I don't know, on jessie.  it wasn't before on precise
[20:16:44] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1262348 (10Dzahn) >>! In T87097#1262326, @Multichill wrote:  > In the phabricator email, it has a link to https://phabricator.wikimedia.org/L2, here it's just "L2". Must be very secret, I'm...
[20:16:47] <bblack>	 I remember having to add it to get them all set
[20:17:03] <White_Master>	 wmf4 lauching today?
[20:17:17] <bblack>	 the service in any case would be default-configured to the right/matching policy, but not doing anything because it's functionally broken anyways
[20:18:00] <greg-g>	 White_Master: wmf4 went to non-Wikipedias today (Commons etc). Will hit Wikipedias tomorrow.
[20:18:56] <White_Master>	 greg-g, oh, thanks. :)
[20:20:44] <wikibugs>	 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1262372 (10RobH)
[20:21:05] <wikibugs>	 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10RobH)
[20:23:24] <ori>	 bblack: the tool will fail if the governor is unavailable, whereas writing it into /proc will fail silently
[20:23:28] <greg-g>	 White_Master: fyi: https://www.mediawiki.org/wiki/MediaWiki_1.26/Roadmap#Schedule_for_the_deployments
[20:24:04] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[20:25:38] <White_Master>	 greg-g, yes, i look this page. I check 'cause i also update my wiki with those versions :P
[20:26:48] <bblack>	 ori: the tool needs upstream updates or us hacking its initscripts, etc.  either way...
[20:27:14] <bblack>	 postponing until I have a working kernel to even try the alernatives on
[20:27:17] <bblack>	 *alternatives
[20:30:22] <bblack>	 (and we could make the manual method not-silent by having puppet check them, too)
[20:30:41] <bblack>	 but even manually applying "performance" is broken until update->reboot
[20:30:57] <wikibugs>	 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1262387 (10Dzahn) @multichill  please try viewing that document again. after talking with chasemp i added you to the following group:   https://phabricator.wikimedia.org/project/profile/974/
[20:40:10] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Analytics: Access to stat1003 for jdouglas - https://phabricator.wikimedia.org/T98209#1262408 (10Krenair)
[20:51:45] <wikibugs>	 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1262429 (10RobH) a:5RobH>3akosiaris Alex,  I wasn't sure if this needed trusty or jessie, so I put trusty on initially.  I then committed the dhcp file change for jessie, so if trusty was wrong,...
[20:52:01] <wikibugs>	 6operations: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1262431 (10RobH)
[20:53:11] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1262436 (10chasemp) @johnlewis and @dzahn asked me to take look over the dumped DB from a sensitive information perspective.  A few thoughts:  * we should wipe the profile_setting tab...
[20:56:26] <wikibugs>	 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labvirt1005 memory errors - https://phabricator.wikimedia.org/T97521#1262455 (10Cmjohnson) Hi Christopher,  This is Regarding the Case Number:4651331170  I have made arrangements to ship a replacement System board along with an onsite engineer.  Part...
[20:58:26] <wikibugs>	 6operations: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1262463 (10yuvipanda) So current puppetmasters are all precise, and on labs everytime we tried a trusty puppetmaster something or the other has blown up. Tread carefully, but it would indeed...
[21:00:04] <jouncebot>	 rmoen, kaldari: Respected human, time to deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150505T2100). Please do the needful.
[21:10:32] <wikibugs>	 6operations: Replace mysql with mariadb on virt1000 (et al) - https://phabricator.wikimedia.org/T84470#1262540 (10Andrew)
[21:18:39] <andrewbogott>	 twentyafterfour or greg-g, today’s train deploy seems to have broken search on wikitech, can you assist?
[21:18:50] <twentyafterfour>	 andrewbogott: ok
[21:19:10] <andrewbogott>	 twentyafterfour: https://dpaste.de/LNE0
[21:19:59] <andrewbogott>	 “Cannot use Hooks as Hooks” <3
[21:20:26] <twentyafterfour>	 weird ..
[21:20:32] <QPanda>	 ori: legoktm ^ more FormatJson fun?
[21:20:37] <QPanda>	 silver is running php and not hhvm
[21:20:57] <QPanda>	 Krenair: ^
[21:22:20] <grrrit-wm>	 (03PS1) 10John F. Lewis: Use Wiki.svg for wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209126 
[21:22:40] <twentyafterfour>	 andrewbogott: I'll poke it
[21:22:52] <andrewbogott>	 twentyafterfour: thank you!  Is the problem obvious?
[21:23:14] <twentyafterfour>	 I need to find that code...
[21:23:26] <twentyafterfour>	 not sure what changed, but it's not entirely obvious
[21:23:58] <andrewbogott>	 I don’t know who the search people are these days.  You?  manybubbles?
[21:24:08] <manybubbles>	 what is up?
[21:24:35] <andrewbogott>	 manybubbles: um… I probably paged you prematurely.  twentyafterfour is working on https://dpaste.de/LNE0 (happening on wikitech right now.)
[21:25:01] <manybubbles>	 andrewbogott: ah. looks like fun versioning issues
[21:25:23] <andrewbogott>	 Well, wikitech should only ever run version n-1
[21:26:28] <twentyafterfour>	 wait so it got updated today to wmf4 and that shouldn't have until tomorrow?
[21:27:06] <andrewbogott>	 twentyafterfour: I thought it lagged behind production by a point, is all.  I could be confused.
[21:27:15] <Krenair>	 It's a group 1 wiki
[21:27:26] <grrrit-wm>	 (03CR) 10Jalexander: [C: 031] "Verifying, WikimaniaWiki wants to update their logo. Ellie asked for my help to get it done quickly (before registration uploads and some " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209126 (owner: 10John F. Lewis)
[21:27:27] <andrewbogott>	 Krenair: doesn’t that mean… what I said?
[21:27:27] <twentyafterfour>	 ok so then it correctly got updated today with group 1
[21:27:32] <Krenair>	 Group 1 wikis receive updates on Tuesdays
[21:28:30] <twentyafterfour>	 andrewbogott: there are 3 groups, group 0 1 and 2 ... group 1 is updated on tuesday, group 2 on wednesday (while group zero gets bleeding edge on wednesday)
[21:28:41] <andrewbogott>	 hm, meanwhile, manybubbles do you know if/how search is monitored on production?  should have an alert for this…
[21:29:06] <andrewbogott>	 Oh, and group 0 doesn’t include production wikipedia, I take it?
[21:29:34] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: move analytics-internal list to analytics-wmf - https://phabricator.wikimedia.org/T97618#1262675 (10kevinator) 5Open>3declined a:3kevinator In light of difficulty of getting this, I am canceling this task.  Our team can live with keeping the list as it is.
[21:29:37] <manybubbles>	 andrewbogott: we monitor things like slow queries but I forget how we monitor that a search works. we must but I forget it
[21:29:46] <bd808>	 No. group 0 is a very small set of wikis (test, test2, wm.o and wikidatatest)
[21:29:53] <andrewbogott>	 ok, I was confused then.
[21:30:20] <bd808>	 group1 is everything except the wikipedias
[21:30:58] <Krenair>	 andrewbogott, https://www.mediawiki.org/wiki/MediaWiki_1.26/Roadmap is very clear
[21:31:03] <andrewbogott>	 OK, so wikitech may be showing breakage that’s on deck for wikipedia.
[21:31:10] <Krenair>	 you should probably bookmark it
[21:32:04] <twentyafterfour>	 andrewbogott: yeah ... so we need to fix.
[21:32:30] <twentyafterfour>	 Krenair: no matter how clear it is, it's still slightly confusing because of the way they overlap
[21:35:05] <grrrit-wm>	 (03CR) 10coren: [C: 032] "Trivial package addition." [puppet] - 10https://gerrit.wikimedia.org/r/209038 (https://phabricator.wikimedia.org/T98195) (owner: 10coren)
[21:36:16] <twentyafterfour>	 so ... there is already something in the CirrusSearch namespace named Hooks?
[21:36:21] <twentyafterfour>	 somehow
[21:39:19] <twentyafterfour>	 looks like maybe I164ad2dbcf8008b551288cab4c90bcbd0df33024
[21:40:14] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3555 MB (9% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 382635 MB (26% inode=99%)
[21:41:26] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile and deployment-prep project - https://phabricator.wikimedia.org/T98179#1262691 (10Dzahn) yea, but it got already separated from yet another thing: T97866, so that's good
[21:45:13] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3107 MB (8% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 381630 MB (26% inode=99%)
[21:45:49] <grrrit-wm>	 (03PS1) 10Ori.livneh: Remove wmgUseBits setting, now that the migration is complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209130 
[21:50:03] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[21:50:14] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 2678 MB (7% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 380722 MB (26% inode=99%)
[21:52:34] <grrrit-wm>	 (03PS1) 10coren: Add tbayer to researchers [puppet] - 10https://gerrit.wikimedia.org/r/209131 (https://phabricator.wikimedia.org/T97916) 
[21:53:00] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile and deployment-prep project - https://phabricator.wikimedia.org/T98179#1262713 (10JohnLewis) I've added "niedzielski" to deployment-prep as a member.
[21:53:56] <twentyafterfour>	 Someone with more familiarity with cirrussearch want to take a look? I can't find any conflicting names in the cirrussearch code, so "import \Hooks" should be completely fine?
[21:54:15] <rmoen>	 greg-g: Might go over the window a little today.  Any objections? 
[21:54:29] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile - https://phabricator.wikimedia.org/T98179#1262714 (10coren)
[21:54:41] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile - https://phabricator.wikimedia.org/T98179#1261519 (10coren) I've retitled the task accordingly.
[21:55:09] <twentyafterfour>	 oh,  no I'm wrong. .
[21:55:11] <greg-g>	 rmoen: the only issue is if anyone else is waiting, but I don't think so (no need to specifically ping me about that)
[21:55:13] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 2263 MB (6% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 379674 MB (26% inode=99%)
[21:55:26] <rmoen>	 greg-g: ok
[21:55:35] <legoktm>	 twentyafterfour: what's up?
[21:55:43] <twentyafterfour>	 there is a global Hooks and a CirrusSearch namespaced \Hooks
[21:55:52] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile - https://phabricator.wikimedia.org/T98179#1262717 (10coren) p:5Triage>3Normal
[21:56:03] <twentyafterfour>	 legoktm: fatal error on wikitech
[21:56:08] <legoktm>	 traceback?
[21:56:10] <twentyafterfour>	 https://dpaste.de/LNE0#L2
[21:56:18] <twentyafterfour>	 (pasted by andrewbogott )
[21:56:23] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60633 bytes in 0.766 second response time
[21:57:00] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski releasers-mobile in production and deployment-prep in labs - https://phabricator.wikimedia.org/T98179#1262720 (10Krenair)
[21:57:10] <twentyafterfour>	 I think https://gerrit.wikimedia.org/r/#/c/207020/ is the culprit
[21:57:28] <legoktm>	 yeah that's not going to work
[21:57:51] <twentyafterfour>	 global Hooks class collides with CirrusSearch/includes/Hooks.php class Hooks
[21:58:04] <twentyafterfour>	 so that needs to alias \Hooks to GlobalHooks  or something
[22:00:14] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1866 MB (5% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 378562 MB (26% inode=99%)
[22:01:13] <twentyafterfour>	 andrewbogott: ok, that was it.
[22:02:00] <andrewbogott>	 twentyafterfour: cool.  Not too late for swat, is it?
[22:02:39] <twentyafterfour>	 I don't know, rmoen are you still swatting?
[22:03:01] <rmoen>	 twentyafterfour: Yes.  Need a few more minutes
[22:03:48] <twentyafterfour>	 rmoen: we've got another patch that needs to go out, you wanna deploy it? https://gerrit.wikimedia.org/r/#/c/209135/
[22:04:15] <JohnLewis>	 twentyafterfour: hm, swat?
[22:04:31] <twentyafterfour>	 ?
[22:04:41] <JohnLewis>	 thought that was in an hour? (unless I'm getting mixed up)
[22:04:46] <twentyafterfour>	 oh
[22:04:52] <twentyafterfour>	 I'm probably the one mixed up
[22:05:14] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1509 MB (4% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 377274 MB (26% inode=99%)
[22:05:20] <rmoen>	 twentyafterfour: Yeah I could do that.  Unless swat is in an hour?  
[22:05:22] <JohnLewis>	 if people want to swat an hour early, I don't mind :) I have a patch in it :p
[22:05:30] <rmoen>	 hah
[22:05:46] <andrewbogott>	 twentyafterfour: I have to run now — thanks for sorting things!
[22:05:56] <twentyafterfour>	 andrewbogott: no problem
[22:06:00] <twentyafterfour>	 thanks for catching it
[22:10:14] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1189 MB (3% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 375734 MB (25% inode=99%)
[22:15:14] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 853 MB (2% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 374168 MB (25% inode=99%)
[22:18:03] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Give Neil Quinn access to stats1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97746#1262789 (10Neil_P._Quinn_WMF) @coren, I've signed the document. My public key is:  ``` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDA6j1BKp5iz7VLQ...
[22:19:12] <spagewmf_>	 mutante: who can +2 the dev.wikimedia.org redirect you +1d, https://gerrit.wikimedia.org/r/#/c/199182/ ?
[22:19:57] <jgage>	 ^^ i can't ssh into lutetium
[22:20:14] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 384 MB (1% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 372795 MB (25% inode=99%)
[22:20:29] <grrrit-wm>	 (03CR) 10Jforrester: "Is this now good to go?" [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke)
[22:20:52] <mutante>	 spagewmf_: any ops who is willing to deploy apache 
[22:20:58] <mutante>	 jgage: i think that's because it's fundraising
[22:21:13] <mutante>	 jgage: lemme try
[22:21:16] <jgage>	 arr i always forget whcih hosts are
[22:21:21] <jgage>	 wish they were in their own subdomain
[22:22:54] <mutante>	 jgage: yes, that's it. gotta bastion via tellurium
[22:23:45] <jgage>	 who is kbrownell?
[22:23:48] <mutante>	 !log apt-get clean on lutetium to free disk space
[22:23:56] <morebots>	 Logged the message, Master
[22:24:06] <mutante>	 jgage: ^ 1.3G free now
[22:24:07] <grrrit-wm>	 (03PS4) 10Spage: Redirect dev.wikimedia.org URLs [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) 
[22:24:16] <jgage>	 thanks mutante
[22:24:54] <mutante>	 jgage: dunno, but somebody who works on FR's civicrm apparently
[22:25:10] <jgage>	 no match on staff & contractors page
[22:25:13] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1227 MB (3% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 371355 MB (25% inode=99%)
[22:26:41] <spagewmf_>	 hello "any ops who is willing to deploy apache", can you +2 https://gerrit.wikimedia.org/r/#/c/199182/
[22:27:00] <mutante>	 jgage: that's a good point. it should really be mentioned somewhere. the user name _does_ show up in phabricator but only in comments it seems
[22:27:32] <mutante>	 jgage: found it. kbrownell works for Giant Rabbit  https://phabricator.wikimedia.org/T83469#914434
[22:27:46] <jgage>	 cool
[22:28:01] <mutante>	 frack-puppet:manifests/accounts_and_groups.pp 
[22:28:44] <mutante>	 https://www.giantrabbit.com/client-list
[22:28:49] <mutante>	 ^ lists Wikimedia
[22:29:53] <_joe_>	 mutante: maybe that's true?
[22:29:59] <grrrit-wm>	 (03PS5) 10GWicke: Add commons to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) 
[22:30:02] <grrrit-wm>	 (03CR) 10GWicke: "Odd, must have forgotten to push the full change. Fixed now." [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) (owner: 10GWicke)
[22:30:02] <grrrit-wm>	 (03CR) 10GWicke: "Lets do #208193 first." [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke)
[22:30:02] <grrrit-wm>	 (03PS4) 10Yuvipanda: tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) 
[22:30:03] <grrrit-wm>	 (03PS5) 10Yuvipanda: tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) 
[22:30:04] <_joe_>	 they list civicvrm
[22:30:13] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1150 MB (3% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 369879 MB (25% inode=99%)
[22:30:25] <spagewmf_>	 mutante: is there a window wherein ops deploys apache changes? dev.wikimedia.org is not critical, I can wait for other apache changes
[22:30:50] <Deskana>	 Just like James_F, I totally didn't write any code at all.
[22:30:57] * James_F grins.
[22:31:08] <Deskana>	 Unlike James_F, I wrote this in the wrong channel.
[22:31:14] <James_F>	 Indeed.
[22:31:41] <mutante>	 spagewmf_: no, i'm afraid not. i suggested there should be a SWAT or something because of this
[22:35:14] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 987 MB (2% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 368644 MB (25% inode=99%)
[22:40:13] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 804 MB (2% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366939 MB (25% inode=99%)
[22:44:14] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[22:45:13] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 699 MB (1% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366540 MB (25% inode=99%)
[22:46:16] <mutante>	 jgage: there's an almost 10G slow query log there :/
[22:46:36] <yuvipanda>	 Move it To NFS! (LabsSolution(tm))
[22:46:49] <mutante>	 jgage: is that why you asked for that user?
[22:47:15] <mutante>	 yuvipanda: ok. project "dispenser", instance "osm-tile-server-01"
[22:47:25] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0]
[22:47:53] <icinga-wm>	 PROBLEM - puppet last run on mw1242 is CRITICAL puppet fail
[22:50:14] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 699 MB (1% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366540 MB (25% inode=99%)
[22:52:06] <logmsgbot>	 !log rmoen Synchronized php-1.26wmf4/extensions/MobileFrontend/: Update MobileFrontend (duration: 00m 39s)
[22:52:17] <morebots>	 Logged the message, Master
[22:52:20] <mutante>	 !log gzip lutetium-slow.log on lutetium to save disk space
[22:52:25] <morebots>	 Logged the message, Master
[22:52:44] <logmsgbot>	 !log rmoen Synchronized php-1.26wmf4/extensions/Gather/: Update Gather to master (duration: 00m 25s)
[22:52:49] <morebots>	 Logged the message, Master
[22:53:23] <logmsgbot>	 !log rmoen Synchronized php-1.26wmf3/extensions/MobileFrontend/: Update MobileFrontend (duration: 00m 31s)
[22:53:28] <morebots>	 Logged the message, Master
[22:54:10] <logmsgbot>	 !log rmoen Synchronized php-1.26wmf3/extensions/Gather/: Update Gather to master (duration: 00m 36s)
[22:54:15] <morebots>	 Logged the message, Master
[22:55:13] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366540 MB (25% inode=99%)
[22:55:30] <rmoen>	 Turns out i need to run scap since I cannot update the i18n with sync-l10nupdate-1 anymore ? 
[22:56:08] <bd808>	 yeah scap is the way to update the prod l10n caches
[22:56:32] <rmoen>	 ;/
[22:56:40] <bd808>	 the l10n part is the bulk of the time in a full scap so it doesn't really cost you much
[22:57:14] <bd808>	 we really should add an option to only update a given branch at some point
[22:57:16] <wikibugs>	 6operations, 10OpenStreetMap, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1262962 (10RobH) @yurik,  I don't see any kind of task history linked into this; has there been an operations team member wo...
[22:57:44] <bd808>	 I just scapped yesterday afternoon though so it shouldn't be too bad. Probably ~25 minutes
[22:57:48] <rmoen>	 ok
[22:58:04] <logmsgbot>	 !log rmoen Started scap: Updates for Gather and MobileFrontend
[22:58:11] <morebots>	 Logged the message, Master
[22:59:05] <grrrit-wm>	 (03PS1) 10Ori.livneh: update mod_expires config for static/ [puppet] - 10https://gerrit.wikimedia.org/r/209145 
[22:59:30] <grrrit-wm>	 (03PS2) 10Ori.livneh: update mod_expires config for static/ [puppet] - 10https://gerrit.wikimedia.org/r/209145 
[22:59:38] <rmoen>	 bd808: seems like section https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Alternative_to_scap needs updated .. since no more sync-l10nupdate-1 should this section be removed entirely  ? 
[22:59:46] <greg-g>	 rmoen: eek, just fyi, starting scap 2 minutes before SWAT isn't the best, I didn't htink you'd go over an hour
[23:00:05] <jouncebot>	 RoanKattouw, ^d, bd808, James_F, legoktm, JohnLewis, twentyafterfour: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150505T2300). Please do the needful.
[23:00:10] <rmoen>	 greg-g: I knows sorry. I had to though we have like 10 new messages
[23:00:11] <legoktm>	 o/
[23:00:14] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366540 MB (25% inode=99%)
[23:00:14] * greg-g nods
[23:00:19] <greg-g>	 rmoen: s'ok
[23:00:21] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] "tested on mw1041" [puppet] - 10https://gerrit.wikimedia.org/r/209145 (owner: 10Ori.livneh)
[23:00:27] <wikibugs>	 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1262969 (10yuvipanda) p:5Low>3Normal
[23:00:51] <rmoen>	 looks like scap is  going quick
[23:01:02] * JohnFLewis peers in
[23:01:04] <greg-g>	 I think I can I think I can
[23:01:15] <RoanKattouw>	 haha
[23:01:44] <wikibugs>	 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1207452 (10yuvipanda) So right now, we have five shelves of disks, and  ```/dev/mapper/store-now    40T   11T   30T  27% /srv/project```  So about 72% free. What's preventing us from moving to RA...
[23:01:48] <RoanKattouw>	 rmoen: Are you still scapping?
[23:01:49] <bd808>	 rmoen: [[How_to_deploy_code#Alternative_to_scap]] is no more. Thanks for pointing that out
[23:02:14] <bd808>	 RoanKattouw: only 3 minutes in
[23:02:15] <rmoen>	 https://www.youtube.com/watch?v=qNVU23knqZw
[23:03:57] <grrrit-wm>	 (03CR) 10Ori.livneh: "What on earth is "sed "s/0"$'\b'"INFINITY/INFINITY/g""?" [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt)
[23:04:58] <rmoen>	 RoanKattouw: sorry yes. .syncing apaches now
[23:05:03] <HaeB>	 \\\\
[23:05:13] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366539 MB (25% inode=99%)
[23:05:31] <mutante>	 syncing apaches is still a thing?
[23:05:39] <mutante>	 you mean mw though, right
[23:05:43] <ori>	 yes
[23:05:48] <mutante>	 ok
[23:05:54] <icinga-wm>	 RECOVERY - puppet last run on mw1242 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures
[23:07:42] <grrrit-wm>	 (03PS1) 10Dereckson: Enable NewUserMessage on bh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209146 (https://phabricator.wikimedia.org/T97920) 
[23:10:13] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366539 MB (25% inode=99%)
[23:11:30] <kaldari>	 How are wikis defined as private, fishbowl, closed, small, medium, large, etc. wikis (for the purposes of configuration)?
[23:11:36] <wikibugs>	 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1129802 (10RobH) Ordered, ETA 2015-05-18.
[23:11:48] <kaldari>	 mutante: ^
[23:11:59] <James_F>	 kaldari: .dblist entries?
[23:12:25] <kaldari>	 oh
[23:12:48] <James_F>	 kaldari: Some of them are magically computed, but most are just plain old .dblist rows.
[23:13:22] <mutante>	 what James said, files called .dblist in mediawiki-config
[23:13:29] <kaldari>	 James_F: I see now. Thanks!
[23:13:35] <James_F>	 E.g. don't look at group1.dblist unless you want to cry.
[23:13:35] <ori>	 James_F: ;)
[23:13:47] * James_F grins at ori, Disruptor of Worlds™
[23:14:25] <ori>	 James_F: why cry? an expression is surely better than an unspecified understanding that isn't encoded in software at all
[23:14:57] <RoanKattouw>	 ori: Whoa programmatic dblists, nice!
[23:15:07] <RoanKattouw>	 ori: So can we support comments in dblists now? :D
[23:15:15] * RoanKattouw has wanted that since 2009 or so
[23:15:15] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366539 MB (25% inode=99%)
[23:15:40] <legoktm>	 https://noc.wikimedia.org/conf/highlight.php?file=group1.dblist :(
[23:15:55] <James_F>	 legoktm: Forgot the symlink, probably.
[23:16:07] <ori>	 RoanKattouw: no, but it's trivial to add now, since all dblist file-loading is done in one place
[23:16:31] <ori>	 James_F: where should the symlink go?
[23:16:53] <James_F>	 ori: Reminding myself right now.
[23:17:03] <legoktm>	 there's a script to create them
[23:17:19] <James_F>	 Yeah.
[23:19:04] <bd808>	 ori: in docroot/noc/conf
[23:19:28] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "@valhallasw: Moving into init.pp is okay; will do that." [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt)
[23:19:42] <James_F>	 ori: https://github.com/wikimedia/operations-mediawiki-config/blob/master/docroot/noc/createTxtFileSymlinks.sh
[23:19:49] <ori>	 yes i was just staring at that
[23:20:13] <icinga-wm>	 PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366539 MB (25% inode=99%)
[23:22:49] <grrrit-wm>	 (03CR) 10Spage: "This is OK to deploy but note the destination URL is likely to change later in April to a labs instance, and probably again in June to a n" [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage)
[23:23:00] <grrrit-wm>	 (03CR) 10Ori.livneh: "@Tim: That's amazing, hah. Could you add that link to a comment in that file?" [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt)
[23:23:07] <mutante>	 !log deleted 8G recurring_blocked.tsv from lutetium
[23:23:23] <morebots>	 Logged the message, Master
[23:23:42] <grrrit-wm>	 (03PS2) 10Tim Landscheidt: gridengine: Puppetize gridengine-mailer [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) 
[23:24:18] <grrrit-wm>	 (03CR) 10Ori.livneh: "(hosting a .wikimedia.org URL on labs is a non-starter, IMO, because of cross-origin security issues.)" [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage)
[23:24:51] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] gridengine: Puppetize gridengine-mailer [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt)
[23:26:21] <JohnFLewis>	 Still waiting for scap to finish? :/
[23:26:57] <grrrit-wm>	 (03PS3) 10coren: access: Remove Erik Moeller's Production Shell Access [puppet] - 10https://gerrit.wikimedia.org/r/208566 (owner: 10Matanya)
[23:27:44] <grrrit-wm>	 (03CR) 10coren: [C: 032] "Thanks for all the dedication and passion over the years, Erik. You're always welcome to get suckered into helping us again as a voluntee" [puppet] - 10https://gerrit.wikimedia.org/r/208566 (owner: 10Matanya)
[23:28:14] <rmoen>	 JohnFLewis: sync-common:  99% (ok: 464; fail: 0; left: 1)
[23:28:28] <greg-g>	 rmoen: how long has it been like that?
[23:28:39] <rmoen>	 a while
[23:28:49] * bd808 looks to see which is hung
[23:28:56] <JohnFLewis>	 greg-g: 30 minutes
[23:29:04] <greg-g>	 ....
[23:29:23] <bd808>	 grrr.. my old enemy snapshot1004.eqiad.wmnet
[23:29:33] <wikibugs>	 6operations, 5Patch-For-Review: Remove Erik Moeller's Production Shell Access - https://phabricator.wikimedia.org/T97864#1263085 (10coren) 5Open>3Resolved a:3coren After a chat with Erik, he has no intention to use his access as a volunteer in the short term and so agrees that it's wiser to turn it off f...
[23:29:40] <grrrit-wm>	 (03PS3) 10Yuvipanda: gridengine: Puppetize gridengine-mailer [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt)
[23:29:52] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] "Thanks Tim!" [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt)
[23:29:57] <bd808>	 rmoen: open a second ssh session to tin and kill the ssh process you own there that is connecting to snapshot1004.eqiad.wmnet
[23:30:05] <rmoen>	 ok
[23:30:05] <bd808>	 that will unstick the scap
[23:30:20] <bd808>	 !log snapshot1004.eqiad.wmnet hanging scap yet again
[23:30:26] <grrrit-wm>	 (03CR) 10Eevans: [C: 031] Add commons to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) (owner: 10GWicke)
[23:30:27] <morebots>	 Logged the message, Master
[23:31:14] <rmoen>	 bd808: ok scap-rebuild-cdbs now
[23:31:17] <rmoen>	 thanks
[23:31:31] <bd808>	 yw. I should have been paying closer attention
[23:32:06] <bd808>	 that snapshot1004.eqiad.wmnet box has been completely overloaded on a regular basis for more than a week
[23:33:22] <bd808>	 !log running sync-common on snapshot1004.eqiad.wmnet manually after it was aborted in scap by rmoen
[23:33:29] <morebots>	 Logged the message, Master
[23:39:16] <logmsgbot>	 !log rmoen Finished scap: Updates for Gather and MobileFrontend (duration: 41m 11s)
[23:39:23] <morebots>	 Logged the message, Master
[23:39:24] * rmoen claps
[23:39:32] <greg-g>	 rmoen: next time wait for SWAT if it's that close :)
[23:39:35] <JohnFLewis>	 \o/
[23:39:45] <rmoen>	 greg-g: ok.  my apologies 
[23:39:59] <greg-g>	 rmoen: or, have your team prep the patches before the deploy (just saw your email) and, well, test on Beta Cluster since it seems like that was possible with what you described
[23:40:41] <greg-g>	 wasn't*
[23:41:02] <rmoen>	 greg-g: yeah its all on me.  I should have prepared prior.  I'm sorta on light duty right now because of RSI issues ;/ so i take full blame on being unprepared today
[23:41:39] <greg-g>	 s'alright, but I'm still worried about your team deploying code to enwiki that hasn't been tested (in the same states/checkout points) on beta cluster
[23:41:57] <greg-g>	 s/worried/annoyed and disheartened
[23:42:33] <bd808>	 someday we will fix the pipeline so that's not even an option. someday
[23:42:53] <rmoen>	 I agree.  Checking out all the branches locally.  But we should definitely be testing on beta cluster.  I think we need to enable for test2? 
[23:42:56] <JohnFLewis>	 So who's the <s>un</s>lucky swatter today?
[23:43:11] <greg-g>	 yeah, so for now we're understaffed to do that and people work around it and take more risks than they should
[23:43:12] <RoanKattouw>	 Me
[23:43:31] <RoanKattouw>	 Oh look at that the scap is done
[23:43:35] <RoanKattouw>	 rmoen: Are you done now?
[23:43:37] <rmoen>	 yes
[23:43:47] <JohnFLewis>	 RoanKattouw: mind if I ask for you to deploy my patch first if possible?
[23:43:55] <RoanKattouw>	 Sure
[23:44:47] <hoo>	 springle: thanks for the new s5 R710 :D
[23:45:29] <grrrit-wm>	 (03CR) 10Catrope: [C: 032] Use Wiki.svg for wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209126 (owner: 10John F. Lewis)
[23:45:35] <grrrit-wm>	 (03Merged) 10jenkins-bot: Use Wiki.svg for wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209126 (owner: 10John F. Lewis)
[23:46:58] <hoo>	 The difference is well noticeable
[23:47:36] <jgage>	 !log switched hadoop active namenode from analytics1001 to analytics1002 for rack C4 switch replacement tomorrow morning (T93730)
[23:47:44] <morebots>	 Logged the message, Master
[23:49:15] <logmsgbot>	 !log catrope Synchronized wmf-config/InitialiseSettings.php: Use Wiki.svg for wikimania2015wiki logo (duration: 00m 19s)
[23:49:20] <morebots>	 Logged the message, Master
[23:49:23] <RoanKattouw>	 JohnFLewis: Done ---^^
[23:49:51] <JohnFLewis>	 RoanKattouw: and confirmed no difference. Thanks :)
[23:52:19] <springle>	 hoo: yw
[23:57:21] <bd808>	 !log aborted and restarted sync-common on snapshot1004.eqiad.wmnet manually after waiting 24 minutes with no progress
[23:57:29] <morebots>	 Logged the message, Master
[23:58:06] <bd808>	 anybody know why snapshot1004 gets so IO bound?
[23:58:33] <wikibugs>	 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1263282 (10yuvipanda)