[00:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150505T0000). Please do the needful. [00:01:03] (03PS1) 10Negative24: phabricator: Remove obsolete configs [puppet] - 10https://gerrit.wikimedia.org/r/208848 [00:08:41] (03PS1) 10Yuvipanda: zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 [00:08:44] ottomata: ^ :) [00:10:37] (03CR) 10Yuvipanda: "Note that I'm experimenting with mesos and other frameworks that use Zookeeper internally and hence am interested in making this generic :" [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda) [00:14:51] * aude waves [00:15:24] running scripts and then shall deploy [00:17:52] aude: Good luck... I'm heading to bed [00:20:09] hoo: good night [00:20:14] scripts are done [00:20:30] ok :) [00:22:01] (03CR) 10Aude: [C: 032] Enable use of subscription tracking on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208675 (owner: 10Aude) [00:22:10] (03Merged) 10jenkins-bot: Enable use of subscription tracking on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208675 (owner: 10Aude) [00:24:08] !log aude Synchronized wmf-config/Wikibase.php: Enable Wikibase subscription tracking (duration: 00m 12s) [00:24:18] Logged the message, Master [00:24:33] (03PS1) 10Yuvipanda: tools: Make checkers submit hosts as well [puppet] - 10https://gerrit.wikimedia.org/r/208853 [00:24:46] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Make checkers submit hosts as well [puppet] - 10https://gerrit.wikimedia.org/r/208853 (owner: 10Yuvipanda) [00:25:20] * aude is done, assuming no problems [00:27:30] PROBLEM - Apache HTTP on mw1197 is CRITICAL - Socket timeout after 10 seconds [00:29:09] PROBLEM - HHVM rendering on mw1197 is CRITICAL - Socket timeout after 10 seconds [00:30:03] :D [00:31:09] PROBLEM - HHVM queue size on mw1197 is CRITICAL 75.00% of data above the critical threshold [80.0] [00:31:30] PROBLEM - HHVM busy threads on mw1197 is CRITICAL 77.78% of data above the critical threshold [115.2] [00:31:50] uh oh [00:31:52] * yuvipanda restarts hhvm [00:32:18] !log restarted hhvm on mw1197 [00:32:26] Logged the message, Master [00:33:49] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 66635 bytes in 1.304 second response time [00:33:50] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.145 second response time [00:38:13] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1259534 (10hashar) I was really just wondering about pre existing usage of ZeroKeeper. @joe promptly addressed it at T95656#1220342 :-] Welcome etcd! [00:39:10] RECOVERY - HHVM queue size on mw1197 is OK Less than 30.00% above the threshold [10.0] [00:39:40] RECOVERY - HHVM busy threads on mw1197 is OK Less than 30.00% above the threshold [76.8] [00:44:49] (03PS1) 10Yuvipanda: tools: Make checker class inherit toollabs base class [puppet] - 10https://gerrit.wikimedia.org/r/208862 [00:45:10] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Make checker class inherit toollabs base class [puppet] - 10https://gerrit.wikimedia.org/r/208862 (owner: 10Yuvipanda) [01:02:02] (03PS1) 10Yuvipanda: tools: Require uwsgi packages for checker [puppet] - 10https://gerrit.wikimedia.org/r/208867 [01:02:23] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Require uwsgi packages for checker [puppet] - 10https://gerrit.wikimedia.org/r/208867 (owner: 10Yuvipanda) [01:11:14] (03PS1) 10Springle: repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208869 [01:12:23] (03PS2) 10Springle: repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208869 [01:13:11] (03CR) 10Springle: [C: 032] repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208869 (owner: 10Springle) [01:19:06] (03Merged) 10jenkins-bot: repool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208869 (owner: 10Springle) [01:20:13] !log springle Synchronized wmf-config/db-eqiad.php: repool db1070, warm up (duration: 00m 19s) [01:20:22] Logged the message, Master [01:21:45] twentyafterfour: just so its on your radar, https://gerrit.wikimedia.org/r/#/c/208848/. we may have to implement the alternative for security.allow-outbound-http before wednesday's deployment if that option is a must [01:31:25] 6operations, 6Labs, 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1259659 (10coren) I don't think that's likely to be possible in the general case; we might be able - at some cost - to gather a list of files that were written around the right timeframe but, unless we know what w... [01:36:09] * Fiona looks up wmgUseBits. [01:37:38] (03PS1) 10Springle: depool db1021, move s5 api to db1049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208876 [01:38:49] (03CR) 10Springle: [C: 032] depool db1021, move s5 api to db1049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208876 (owner: 10Springle) [01:38:54] (03Merged) 10jenkins-bot: depool db1021, move s5 api to db1049 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208876 (owner: 10Springle) [01:41:34] !log springle Synchronized wmf-config/db-eqiad.php: depool db1021, move s5 api to db1049 (duration: 00m 15s) [01:41:41] Logged the message, Master [01:44:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:47:29] ori: btw, <3 require_package [01:47:35] :) [01:47:53] thanks, it was very painful to get right, as bd808 can attest [01:48:10] heh [01:48:24] I like completely ignoring the hard problems and then suddenly being able to benefit from them being solved :D [01:49:11] it wasn't an intellectually deep problem, just a super-annoying one [01:49:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [01:49:32] ori, know anything about bits being broken in beta? [01:49:36] ori: super annoying ones are the worst [01:49:51] Krenair: no, but I saw there was some phab task [01:50:02] Let me take a look [01:50:39] PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail [01:52:19] (03PS1) 10Ori.livneh: Default wmgUseBits to `false` on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208878 [01:52:29] (03CR) 10Ori.livneh: [C: 032] Default wmgUseBits to `false` on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208878 (owner: 10Ori.livneh) [01:52:34] (03Merged) 10jenkins-bot: Default wmgUseBits to `false` on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208878 (owner: 10Ori.livneh) [01:53:30] Fiona: the config var is temporary [01:53:37] I've heard that one before. [01:53:53] I should be able to get rid of it tomorrow [01:54:14] Krenair: better? [01:54:35] I just read https://phabricator.wikimedia.org/T95448 [01:55:00] I had to update my clone of operations-mediawiki-config. It was a whole to-do. [01:55:25] Max still made us it put it there: http://en.wikipedia.org/w/load.php [01:55:30] but no bits are involved. [01:55:35] I remember when Domas created bits. [01:55:42] https:// * [01:55:53] http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page still seems to load stuff from bits [01:55:54] Missing a trailing newline!!! [01:56:05] Krenair: refresh or log in [01:56:12] I did refresh [01:56:21] well, cache-bust [01:56:24] ?q=123109283 [01:56:26] or something [01:56:44] Krenair: Now works fine for me in incognito and logged-in, BTW. [01:56:45] hmph. that worked [01:56:54] maybe beta was just being slow updating [01:56:55] ori: I saw https://news.ycombinator.com/item?id=9484757 today and thought of you. [01:57:28] Fiona: cute [01:57:32] what was the extent of the breakage, James_F? [01:58:12] > Everybody knows PHP is a trickly-typed language. Read the docs people or PHP will take advantage of your gullible ass. [01:58:14] ori: Beta beta was 400-ing. [02:01:43] (03PS1) 10Yuvipanda: tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 [02:02:29] (03CR) 10jenkins-bot: [V: 04-1] tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (owner: 10Yuvipanda) [02:08:39] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:26:48] !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 08m 20s) [02:27:00] Logged the message, Master [02:31:48] !log LocalisationUpdate completed (1.26wmf3) at 2015-05-05 02:30:45+00:00 [02:31:56] Logged the message, Master [02:38:35] 6operations, 10Wikimedia-Site-requests: refreshLinks.php --dfn-only cron jobs do not seem to be running - https://phabricator.wikimedia.org/T97926#1259688 (10PleaseStand) Today, the script did run on dewiki (in s5); see the query results linked from T98110. I think the differing file permissions did prevent t... [02:40:09] I may be way off base here, but: re T97926 ^ [02:40:41] you know, we did wipe out a bunch of refreshLinks jobs from the jobrunner queue directly on Sunday [02:41:05] for enwiki, commonswiki, dewiki, as part of trying to fix the outage issue that day [02:41:19] Krenair: ^ [02:41:59] that job doesn't run on enwiki AFAIK [02:42:09] well [02:42:15] dewiki? [02:42:31] I think it's that particular --dfn-only thing that doesn't run on enwiki, or something [02:42:45] it apparently got run on dewiki [02:43:00] according to PleaseStand's comment [02:43:12] right, it would have run at the next opportunity, and technical did run May 3, but got the axe in the job queue during debugging [02:43:42] just seems like not-a-coincidence that we killed off refreshlinks jobs on May 3, and someone files a ticket on May 3 about their refreshlinks jobs not having run [02:43:57] Anyway this bug is only concerned about s2/s3 (enwiki is s1, commonswiki is s4 and dewiki is s5) [02:44:04] hmmmm ok [02:44:10] The problem was going back to June last year [02:54:18] !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 07m 06s) [02:54:31] Logged the message, Master [02:58:58] !log LocalisationUpdate completed (1.26wmf4) at 2015-05-05 02:57:54+00:00 [02:59:05] Logged the message, Master [02:59:15] hmm [02:59:25] why do we still have a pmtpa reference in wikitech.php? [02:59:33] $wgOpenStackManagerProxyGateways = array('pmtpa' => '208.80.153.214', 'eqiad' => '208.80.155.156'); [03:01:22] krenair@silver:~$ mwscript eval.php labswiki [03:01:27] Notice: Undefined index: SERVER_NAME in /srv/mediawiki/wmf-config/CommonSettings.php on line 206 [03:01:27] mkdir: cannot create directory �/sys/fs/cgroup/memory/mediawiki/job/16043�: Permission denied [03:01:27] limit.sh: failed to create the cgroup. [03:01:29] sigh [03:02:36] cgroups, really? [03:02:58] that's going to be fun when it comes time to move whatever that is to jessie... [03:04:27] (because everything in jessie runs underneath systems, which puts everything in its own set of cgroups. I don't think said things (as in your shell, that script, that mkdir) can then escape that cgroup easily to create a separate root-level cgroup in any sane way) [03:04:33] s/systems/systemd/ [03:06:17] It's https://phabricator.wikimedia.org/T92712 again (still?) [03:06:31] but the SERVER_NAME thing as well now [04:23:26] (03PS2) 10Yuvipanda: tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) [04:24:06] (03CR) 10jenkins-bot: [V: 04-1] tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) (owner: 10Yuvipanda) [04:38:08] (03PS3) 10Yuvipanda: tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) [04:39:16] (03CR) 10jenkins-bot: [V: 04-1] tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) (owner: 10Yuvipanda) [04:43:52] !log tstarling Synchronized php-1.26wmf3/extensions/SecurePoll/cli/wm-scripts/bv2015/voterList.php: (no message) (duration: 00m 19s) [04:43:58] Logged the message, Master [05:07:33] !log tstarling Synchronized php-1.26wmf3/extensions/SecurePoll/cli/wm-scripts/bv2015/voterList.php: (no message) (duration: 00m 16s) [05:07:39] Logged the message, Master [05:51:04] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue May 5 05:50:01 UTC 2015 (duration 50m 0s) [05:51:10] Logged the message, Master [06:29:59] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures [06:29:59] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:30:20] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures [06:30:50] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on elastic1030 is CRITICAL Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:31:10] PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 2 failures [06:31:20] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:32:29] PROBLEM - puppet last run on mw1144 is CRITICAL Puppet has 2 failures [06:35:10] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 3 failures [06:46:10] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:40] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:47:09] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on mw1144 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:21] RECOVERY - puppet last run on elastic1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:21] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:47:30] RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:40] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:39:13] (03CR) 10Filippo Giunchedi: diamond: collectors require python-diamond (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/208840 (owner: 10Hashar) [07:44:51] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:49:50] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [07:59:08] !log test reboot fluorine with new disk [07:59:16] Logged the message, Master [08:04:18] _joe_: morning, are you aware of the fact hhvm 3.7.0 was released today ? [08:04:36] <_joe_> matanya: yeah but it's not a LTS release [08:04:40] <_joe_> we just track those [08:05:16] <_joe_> and well, 3.3 => 3.6 has been quite complicated to do. [08:05:29] <_joe_> and we can't really use FB's packages either [08:05:41] ah, ok. thanks for that. another question, if you don't mind. will the video scalers support vp9 in the near future ? [08:06:41] <_joe_> matanya: no idea, sorry, I'm working on something completely different right now [08:07:06] i'll see if there is a ticket for that [08:07:23] <_joe_> it's surely something we may want to do, but I don't see anyone with the bandwidth to work on that, nor in ops or in any other post-reorg team [08:07:47] found it: https://phabricator.wikimedia.org/T55863 [08:07:54] <_joe_> but I may be wrong, my views are quite fuzzy right now - dust will settle eventually [08:08:50] yeah, i see. will wait for this too. thanks much _joe_ [08:11:41] <_joe_> matanya: as a community member, you should probably speak with someone in product to ask for resources dedicated to the videoscalers/multimedia in general [08:12:39] _joe_: sad to say, but from my POV, multimedia and admin tools are the most neglected areas at WMF eng department. [08:13:27] <_joe_> matanya: well, I may agree, but tbh I think it's a good idea to focus our dev efforts on specific goals and try to nail those down [08:13:51] <_joe_> instead of continuosly work on all the 1000 things we do. Our resources are quite constrained [08:16:40] yes, fair point [08:46:15] (03PS1) 10Faidon Liambotis: Revert "Depool ulsfo, network troubles" [dns] - 10https://gerrit.wikimedia.org/r/208918 [08:46:25] (03PS2) 10Faidon Liambotis: Revert "Depool ulsfo, network troubles" [dns] - 10https://gerrit.wikimedia.org/r/208918 [08:46:31] (03CR) 10Faidon Liambotis: [C: 032] Revert "Depool ulsfo, network troubles" [dns] - 10https://gerrit.wikimedia.org/r/208918 (owner: 10Faidon Liambotis) [08:47:01] !log repooling ulsfo [08:47:09] Logged the message, Master [09:04:50] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:08:10] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [09:09:44] (03Abandoned) 10Hashar: diamond: collectors require python-diamond [puppet] - 10https://gerrit.wikimedia.org/r/208840 (owner: 10Hashar) [09:09:49] (03CR) 10Hashar: diamond: collectors require python-diamond (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/208840 (owner: 10Hashar) [09:10:59] PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail [09:14:10] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:20:36] hashar: every puppet dependency issue can be solved by running puppet often enough =p [09:22:55] (which is part of what makes them hard to debug...) [09:23:40] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [09:25:20] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:30:36] hashar: maybe we should switch to MOAR? https://puppetlabs.com/blog/introducing-manifest-ordered-resources [09:30:42] hashar: I agree it is simpler to treat puppet as converging over time [09:31:01] basically, there's a puppet setting that keeps the ordering in the manifest file [09:31:08] <_joe_> godog: simpler and lamer [09:31:09] <_joe_> :P [09:32:04] machines don't judge! [09:32:37] uncertain gain vs certain loss [09:32:52] it's like LaTeX! Just run it a gazillion times, and at some point your references will be right [09:33:44] yep, and if it diverges you'll fairly quickly [09:34:39] you'll know, even [09:37:56] <_joe_> valhallasw`nuage: I don't think I never ever needed to recompile a latex doc more than three times before it was ok [09:38:01] <_joe_> :P [09:38:13] <_joe_> with puppet, OTOH... [09:40:44] _joe_: when I insert a new bibtex reference, 3 is standard (compile, resolve bibtex references, compile with inserted bibliography, recompile to get references inserted), and 4 happens regularly (because something shifts just over a page edge due to the insertion of those references). [09:41:01] _joe_: and it's probably 5 passes when starting from just the .tex and .bib, but I try not to do that :D [09:41:21] but yeah, puppet is more... random [09:41:22] <_joe_> valhallasw`nuage: I think it was 3 for bibtex/toc [09:41:41] <_joe_> valhallasw`nuage: I don't seriously use latex since I left academia though, so... 2008 [09:42:04] <_joe_> I mean I used latex-beamer a couple of times for presentations before I chose life [09:47:10] I thought latex-beamer was cool once, because you could add pretty formulas to your slides. At some point I realized formulas on slides are a bad idea... [09:47:23] <_joe_> well, it depends [09:47:36] <_joe_> if you're presenting some theoretical physics paper, maybe :P [09:48:51] 6operations, 5Continuous-Integration-Isolation: Disable diamond collector on contintcloud labs project - https://phabricator.wikimedia.org/T98121#1259982 (10hashar) 3NEW a:3hashar [09:57:01] (03PS1) 10Hashar: standard: ability to disable diamond [puppet] - 10https://gerrit.wikimedia.org/r/208924 (https://phabricator.wikimedia.org/T98121) [09:58:11] (03CR) 10Hashar: "The class parameter default to true so that should be a complete noop unless I fail to understand puppet." [puppet] - 10https://gerrit.wikimedia.org/r/208924 (https://phabricator.wikimedia.org/T98121) (owner: 10Hashar) [10:00:33] hashar: famous last words [10:00:46] valhallasw`nuage: if only puppetlabs.com worked :D [10:01:27] valhallasw`nuage: a friend demoed me propellor, a configuration management system using haskell [10:01:48] and it tells you what's wrong with it before you even run ghc? :p [10:01:58] valhallasw`nuage: you can compile your "manifest" locally and that does the validation / ensure all cases are handled because... haskell! [10:02:04] https://propellor.branchable.com/ [10:02:15] the author has a bunch of blog posts [10:02:36] hashar: I suppose, but it can't make sure the manifest actually works, because it e.g. doesn't know what the effect of a change will be [10:02:50] yuo [10:02:51] yup [10:02:59] but the dependencies / ordering is dealt with earlier [10:03:20] <_joe_> propellor: propel yourself into pain and irrelevance with haskell!!!1! [10:03:46] or we can make puppet smart enough to track what it installs, and to understand why failures are now suddenly solved [10:03:50] hmm. [10:04:19] <_joe_> valhallasw`nuage: what is the problem you want to solve? [10:04:20] PROBLEM - puppet last run on fluorine is CRITICAL: Connection refused by host [10:04:33] <_joe_> godog: this you ^^? [10:04:35] _joe_: manifests that only work on the third run and no-one understanding why :-p [10:04:45] <_joe_> well, which ones? [10:04:58] _joe_: yeah downtime finished, rescheduling [10:05:15] <_joe_> valhallasw`nuage: usually that's because they're poorly written or not properly refactored when new features are added [10:05:46] the diamond one hashar tried to fix, for instance [10:05:48] <_joe_> valhallasw`nuage: or, one of the numerous bugs in puppet's dependency chanins [10:06:11] sure, it's typically because the manifest is wrong, but it's really hard to see /why/ it's wrong because it's hard to guess depenencies [10:06:29] and it's hard to test because the issue 'solves itself' on subsequent puppet runs [10:06:35] <_joe_> uhm I'm not usually in that position :) [10:06:56] <_joe_> but I dunno, I'd have to look at the manifest, and I don't have time right now [10:09:05] 7Puppet, 6operations, 10Beta-Cluster: Trebuchet on deployment-bastion: wrong group owner - https://phabricator.wikimedia.org/T97775#1260027 (10mobrovac) >>! In T97775#1252075, @chasemp wrote: > sure, I mean all of those should be owned by trebuchet and deployment since deployment is the group for deployers.... [10:09:08] This kernel does not support a non-PAE CPU. [10:09:09] \o/ [10:12:01] (03PS1) 10Alexandros Kosiaris: Disable manual puppetmaster start/restart [puppet] - 10https://gerrit.wikimedia.org/r/208926 [10:12:46] (03CR) 10Giuseppe Lavagetto: [C: 031] "Obviously a good idea" [puppet] - 10https://gerrit.wikimedia.org/r/208926 (owner: 10Alexandros Kosiaris) [10:13:27] (03PS4) 10Giuseppe Lavagetto: hiera: Add a proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/207128 (https://phabricator.wikimedia.org/T93776) [10:14:02] (03CR) 10Hashar: "Shouldn't it autostart on labs instances having their self puppetmaster?" [puppet] - 10https://gerrit.wikimedia.org/r/208926 (owner: 10Alexandros Kosiaris) [10:14:07] <_joe_> akosiaris: I am going to merge the proxy backend, so that we can later do a full catalog differ run on the change that starts using it. [10:15:32] hashar, uh... win? [10:15:50] RECOVERY - puppet last run on fluorine is OK Puppet is currently enabled, last run 22 minutes ago with 0 failures [10:17:08] (03CR) 10Alexandros Kosiaris: "@hashar, labs instances do not use passenger (well they can but that requires manual work so we can safely assume almost none does)" [puppet] - 10https://gerrit.wikimedia.org/r/208926 (owner: 10Alexandros Kosiaris) [10:19:55] _joe_: ok [10:32:22] 6operations, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Investigation if Fluorine needs bigger disks or we retain too much data - https://phabricator.wikimedia.org/T92417#1260071 (10fgiunchedi) ``` 8 0 2930266584 sda 8 1 7811072 sda1 8 2 78125056 sda2 8 3 18671... [10:33:58] (03PS1) 10Alexandros Kosiaris: graphoid: create admin group [puppet] - 10https://gerrit.wikimedia.org/r/208929 [10:40:28] (03CR) 10Alexandros Kosiaris: [C: 032] graphoid: create admin group [puppet] - 10https://gerrit.wikimedia.org/r/208929 (owner: 10Alexandros Kosiaris) [10:40:52] (03CR) 10Alexandros Kosiaris: [C: 032] Disable manual puppetmaster start/restart [puppet] - 10https://gerrit.wikimedia.org/r/208926 (owner: 10Alexandros Kosiaris) [11:00:27] (03PS1) 10Alexandros Kosiaris: Assign weights to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/208933 [11:09:36] 6operations: Scale up puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1260135 (10akosiaris) 3NEW a:3akosiaris [11:14:22] 6operations: Investigate the compatibility of our puppet tree with ruby1.9 and create a plan to upgrade. - https://phabricator.wikimedia.org/T98129#1260150 (10akosiaris) 3NEW a:3akosiaris [11:16:12] 6operations: Scale up puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1260161 (10akosiaris) [11:16:12] 7Puppet, 6operations: puppet masters are maxed out - https://phabricator.wikimedia.org/T97989#1260160 (10akosiaris) [11:16:45] 6operations: Scale up puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1260135 (10akosiaris) [11:16:45] 6operations: Investigate the compatibility of our puppet tree with ruby1.9 and create a plan to upgrade. - https://phabricator.wikimedia.org/T98129#1260174 (10akosiaris) [11:17:13] (03PS2) 10Alexandros Kosiaris: Assign weights to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/208933 (https://phabricator.wikimedia.org/T98128) [11:19:35] E: Unable to locate package quickstack [11:19:39] how annoying :] [11:22:25] (03CR) 10Alexandros Kosiaris: WIP: Proper labs_storage class (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [11:23:46] Krenair: can you arrage otrs access for me to the crash reports queue please ? [11:24:03] i am an idiot. i have it. [11:25:56] matanya, you are not an idiot :> pls be kind to yourself :P :-) [11:31:40] 6operations: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1260188 (10faidon) [11:33:18] 6operations: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1260150 (10faidon) jessie comes with 2.1 (which should be even faster than 1.9) so I adjusted the description accordingly. That said, I'd expect most of the compatibi... [11:36:21] 6operations, 7Monitoring: Upgrade to newer version of gdash - https://phabricator.wikimedia.org/T98134#1260208 (10faidon) 3NEW [11:39:37] ah puppet apply does not accept multiples --execute :/ [11:39:44] but bash can helps! puppet apply <(echo -e "notify { 'foo': }\nnotify { 'bar': }" ) [11:54:49] (03PS1) 10Alexandros Kosiaris: ganglia_new::web. Increase the default memory limit [puppet] - 10https://gerrit.wikimedia.org/r/208937 (https://phabricator.wikimedia.org/T97637) [11:58:04] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia_new::web. Increase the default memory limit [puppet] - 10https://gerrit.wikimedia.org/r/208937 (https://phabricator.wikimedia.org/T97637) (owner: 10Alexandros Kosiaris) [12:00:14] godog: https://gerrit.wikimedia.org/r/#/c/208933/2/manifests/role/puppetmaster.pp,cm [12:00:23] I think this should give us some breathing room [12:00:49] PROBLEM - puppet last run on cp3036 is CRITICAL puppet fail [12:05:35] (03CR) 10Filippo Giunchedi: [C: 032] Assign weights to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/208933 (https://phabricator.wikimedia.org/T98128) (owner: 10Alexandros Kosiaris) [12:05:42] akosiaris: sweet! LGTM [12:18:43] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage) [12:18:49] RECOVERY - puppet last run on cp3036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:35:20] hoo: server side upload for me please ? [12:35:39] Sure thing [12:35:44] What do you need uploaded? [12:35:51] a movie [12:36:03] do you have access to the video project ? [12:36:55] (03PS1) 10Alexandros Kosiaris: Notify apache on ganglia_new::web php.ini updates [puppet] - 10https://gerrit.wikimedia.org/r/208947 [12:36:55] hoo: the file is in encoding01.wmflabs.org:/home/matanya/Sintel.webm [12:37:26] the description is in the same as https://commons.wikimedia.org/w/index.php?title=File:Sintel_movie_720x306.ogv&action=edit [12:37:38] (03CR) 10jenkins-bot: [V: 04-1] Notify apache on ganglia_new::web php.ini updates [puppet] - 10https://gerrit.wikimedia.org/r/208947 (owner: 10Alexandros Kosiaris) [12:37:46] the file name should be Sintel movie 4K.webm [12:38:49] Ok, will do [12:39:16] thanks [12:42:11] (03PS2) 10Alexandros Kosiaris: Notify apache on ganglia_new::web php.ini updates [puppet] - 10https://gerrit.wikimedia.org/r/208947 [12:46:05] hoo: please ping me when done. [12:46:31] I'm having slight trouble with rsync, give me a sec [12:46:41] I hate having to do the proxycommand stuff inline... [12:47:59] I manage to get onto the labs bastion, but not onto encoding02.eqiad.wmflabs:22 [12:48:32] hoo: encoding01 [12:48:58] That explains a lot... [12:49:10] rsync: change_dir "/home/matanya" failed: Permission denied (13) [12:49:33] do you want me to move it to somewhere shared ? [12:49:38] Can you move it into a dir I have +x on? [12:49:48] yes, please [12:51:12] moving to /data/project/wikimania2014 [12:51:18] will take some minutes [12:51:23] Ok [12:51:28] or not. done. [12:51:44] Nice [12:52:32] Copying at 38MB/s :) [12:54:12] (03CR) 10Alexandros Kosiaris: [C: 032] Notify apache on ganglia_new::web php.ini updates [puppet] - 10https://gerrit.wikimedia.org/r/208947 (owner: 10Alexandros Kosiaris) [12:56:05] 6operations, 10Traffic, 7discovery-system: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1260323 (10Joe) p:5Low>3High a:3Joe [12:56:33] matanya: Upload with your main account? [12:56:43] yes, please [12:57:46] Upload started [12:58:15] Uploads are very fast with the new swift servers in place nowadays :) [12:59:09] yeah 12 * 3 new spindles helped [12:59:52] i wish i could upload from server side. the limits are funny [13:00:25] Even in that case you have a 4.3(?)GiB limit [13:00:36] I hit that once or twice while uploading things for people [13:01:09] In a more perfect world we would only do server side uploads in case of huge batches of filesw [13:01:35] {{done}} [13:02:35] matanya: why can't you? [13:02:56] no rights godog [13:03:13] thank you very much hoo [13:03:18] You're welcome [13:04:05] ah, the quality! joyfull moments. [13:04:08] !log updating voter list for the FDC election for T97924 [13:04:14] Logged the message, Master [13:05:45] (03CR) 10JanZerebecki: "As detailed in the ticket beta is more complicated. Because it uses the same wiki id for the config as www.wikidata.org so divering those " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208654 (https://phabricator.wikimedia.org/T97993) (owner: 10JanZerebecki) [13:09:20] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [13:10:09] PROBLEM - puppet last run on mw2002 is CRITICAL puppet fail [13:10:09] PROBLEM - puppet last run on tin is CRITICAL puppet fail [13:10:09] PROBLEM - puppet last run on mw1149 is CRITICAL puppet fail [13:10:10] PROBLEM - puppet last run on cp4019 is CRITICAL puppet fail [13:10:10] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [13:10:19] PROBLEM - puppet last run on wtp1012 is CRITICAL puppet fail [13:10:20] PROBLEM - puppet last run on lithium is CRITICAL puppet fail [13:10:30] PROBLEM - puppet last run on snapshot1001 is CRITICAL puppet fail [13:10:40] PROBLEM - puppet last run on mw1118 is CRITICAL Puppet has 50 failures [13:10:40] PROBLEM - puppet last run on mw1044 is CRITICAL puppet fail [13:10:40] PROBLEM - puppet last run on silver is CRITICAL puppet fail [13:10:50] PROBLEM - puppet last run on pc1002 is CRITICAL puppet fail [13:10:59] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 8 failures [13:11:00] PROBLEM - puppet last run on labstore1001 is CRITICAL puppet fail [13:11:06] well that's exciting [13:11:10] PROBLEM - puppet last run on mw2166 is CRITICAL puppet fail [13:11:10] PROBLEM - puppet last run on labcontrol2001 is CRITICAL puppet fail [13:11:10] PROBLEM - puppet last run on mw1211 is CRITICAL Puppet has 1 failures [13:11:10] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 3 failures [13:11:10] PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 8 failures [13:11:10] PROBLEM - puppet last run on db1021 is CRITICAL Puppet has 9 failures [13:11:10] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 8 failures [13:11:19] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 9 failures [13:11:20] PROBLEM - puppet last run on wtp2019 is CRITICAL puppet fail [13:11:20] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 10 failures [13:11:20] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 10 failures [13:11:20] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 8 failures [13:11:21] 7kick icinga-wm [13:11:30] PROBLEM - puppet last run on mw1177 is CRITICAL Puppet has 2 failures [13:11:30] PROBLEM - puppet last run on virt1001 is CRITICAL puppet fail [13:11:30] PROBLEM - puppet last run on mw1237 is CRITICAL puppet fail [13:11:30] PROBLEM - puppet last run on mw1054 is CRITICAL Puppet has 39 failures [13:11:30] PROBLEM - puppet last run on mw1129 is CRITICAL puppet fail [13:11:39] PROBLEM - puppet last run on mw1126 is CRITICAL Puppet has 2 failures [13:11:40] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 9 failures [13:11:40] PROBLEM - puppet last run on mw2136 is CRITICAL Puppet has 1 failures [13:11:40] PROBLEM - puppet last run on wtp2015 is CRITICAL Puppet has 5 failures [13:11:40] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 10 failures [13:11:40] PROBLEM - puppet last run on mw2090 is CRITICAL Puppet has 8 failures [13:11:40] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 7 failures [13:11:49] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 4 failures [13:11:49] akosiaris: ^ ? [13:11:59] PROBLEM - puppet last run on mw1039 is CRITICAL Puppet has 54 failures [13:12:00] PROBLEM - puppet last run on mw1011 is CRITICAL Puppet has 38 failures [13:12:00] PROBLEM - puppet last run on dataset1001 is CRITICAL Puppet has 16 failures [13:12:09] PROBLEM - puppet last run on mw1213 is CRITICAL Puppet has 1 failures [13:12:10] ESC[1;31mError: Could not send report: Connection reset by peer - SSL_connectESC[0m [13:12:10] PROBLEM - puppet last run on labnet1001 is CRITICAL Puppet has 1 failures [13:12:19] PROBLEM - puppet last run on mw1175 is CRITICAL Puppet has 8 failures [13:12:20] PROBLEM - puppet last run on db2042 is CRITICAL Puppet has 6 failures [13:12:20] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 10 failures [13:12:20] PROBLEM - puppet last run on mw2030 is CRITICAL Puppet has 6 failures [13:12:20] PROBLEM - puppet last run on mc1012 is CRITICAL Puppet has 7 failures [13:12:29] PROBLEM - puppet last run on mw2146 is CRITICAL Puppet has 10 failures [13:12:30] PROBLEM - puppet last run on mw2114 is CRITICAL Puppet has 9 failures [13:12:30] PROBLEM - puppet last run on mw2059 is CRITICAL Puppet has 10 failures [13:12:30] PROBLEM - puppet last run on mw2092 is CRITICAL Puppet has 9 failures [13:12:30] PROBLEM - puppet last run on mw2056 is CRITICAL Puppet has 4 failures [13:12:30] PROBLEM - puppet last run on mw1172 is CRITICAL Puppet has 5 failures [13:12:48] godog: that would be me [13:13:08] !log restarted apache2 on palladium [13:13:15] Logged the message, Master [13:13:20] PROBLEM - puppet last run on mw2011 is CRITICAL Puppet has 5 failures [13:14:11] godog: is there a reason the file is not transcoded ? [13:14:24] seemed like a reload would not pickup the balancer change [13:16:03] matanya: I know very little about the multimedia pipeline on the mw side sadly [13:16:23] that is a question for the mutlimedia guys ? [13:17:08] I think so, but you'd have to be more specific [13:18:35] (03CR) 10Filippo Giunchedi: [C: 031] ipsec-global: fix bug in non-verbose mode, exit if not root [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage) [13:22:10] PROBLEM - MySQL Idle Transactions on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 948 seconds [13:22:19] ohoh [13:23:44] db1040 doesn't sound like puppetmaster issues heh [13:24:15] commonswiki [13:24:30] 1000+ secs of Sleep connections [13:24:32] not many though [13:24:42] 7 [13:25:09] and they 're gone now [13:26:15] !log killed db1040 blocking txns T97641 again [13:26:28] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, 3Wikidata-Sprint-2015-04-21: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1260398 (10JanZerebecki) a:5hoo>3daniel What pattern can one search for to find old serialization? [13:26:28] video scalers are not happy either [13:26:40] RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [13:26:40] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [13:26:49] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [13:26:54] matanya: how much data was that upload? [13:27:00] RECOVERY - puppet last run on lithium is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:27:00] RECOVERY - puppet last run on dataset1001 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:27:09] RECOVERY - puppet last run on mw1213 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:27:10] RECOVERY - MySQL Idle Transactions on db1040 is OK longest blocking idle transaction sleeps for 0 seconds [13:27:10] RECOVERY - puppet last run on labnet1001 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [13:27:10] RECOVERY - puppet last run on mw1118 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:27:19] RECOVERY - puppet last run on mw1175 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [13:27:20] RECOVERY - puppet last run on db2042 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:27:20] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:27:20] RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:27:20] RECOVERY - puppet last run on mc1012 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [13:27:29] RECOVERY - puppet last run on pc1002 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:27:30] RECOVERY - puppet last run on mw2146 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [13:27:30] RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:30] RECOVERY - puppet last run on mw2059 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:27:30] RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:27:30] RECOVERY - puppet last run on mw1172 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [13:27:30] RECOVERY - puppet last run on mw2056 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [13:27:30] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:31] RECOVERY - puppet last run on labstore1001 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:27:40] RECOVERY - puppet last run on mw1211 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:40] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:40] RECOVERY - puppet last run on db1021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:50] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:50] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:50] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:27:51] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [13:27:51] RECOVERY - puppet last run on wtp2019 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:27:51] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:51] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [13:28:00] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:00] godog: 3.26 GB [13:28:01] RECOVERY - puppet last run on virt1001 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:28:01] RECOVERY - puppet last run on mw1177 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:09] RECOVERY - puppet last run on mw1237 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:28:09] RECOVERY - puppet last run on mw1054 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:10] RECOVERY - puppet last run on mw1129 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:28:10] godog: springle killing those txns and matanya's video might be related [13:28:10] RECOVERY - puppet last run on mw1126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:20] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:20] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:28:20] RECOVERY - puppet last run on mw2136 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:28:20] RECOVERY - puppet last run on mw2090 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:28:21] RECOVERY - puppet last run on mw1149 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [13:28:21] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:21] RECOVERY - puppet last run on mw2011 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:28:21] RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:28:28] akosiaris: very likely, http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Video+scalers+eqiad&h=&tab=m&vn=&hide-hf=false&m=network_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [13:28:30] RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:30] RECOVERY - puppet last run on wtp1012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:30] RECOVERY - puppet last run on mw1039 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:40] RECOVERY - puppet last run on mw1011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:48] matanya: 3.26gb total? ack [13:28:49] RECOVERY - puppet last run on snapshot1001 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:28:50] RECOVERY - puppet last run on mw1044 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:28:50] RECOVERY - puppet last run on silver is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:29:08] godog: should i be sorry ? [13:29:20] RECOVERY - puppet last run on mw2166 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:29:59] RECOVERY - puppet last run on mw2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:30:06] For the record, "the multimedia guys" are now three JavaScript coders, so sending people to us about transcoding problems might work about 20% of the time. :) [13:30:08] matanya: no I don't think so, it'll take a while for the video scalers to chug through that, so that might be the answer to your transcode question [13:30:39] thanks [13:30:56] marktraceur: haha how come 20%? [13:31:55] godog: Shrug, just a wild guess [13:32:17] somehow I am already seeing transcoid on the horizon [13:33:18] marktraceur: hehe will keep that in mind [13:33:22] we should wikipedoid, a bot writting articles [13:33:30] *should have [13:33:40] that would be a nice service [13:34:50] matanya: We outsourced that job to Bollywood agents. They appear to continue doing it by hand. [13:35:01] haha [13:37:14] (03CR) 10Ottomata: "OO, nice! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda) [13:37:20] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, and 2 others: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1260419 (10Tobi_WMDE_SW) a:5daniel>3hoo [13:38:00] PROBLEM - MySQL InnoDB on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 818 seconds [13:39:11] great [13:39:44] springle: is that my video ? [13:39:49] RECOVERY - MySQL InnoDB on db1040 is OK longest blocking idle transaction sleeps for 0 seconds [13:40:02] matanya: no idea [13:40:49] I doubt it [13:41:46] There are several connections from terbium opened, maybe someone is doing weird stuff? [13:42:45] the connections I've seen come from videoscalers T97641. Run that query, then sleep and hold locks for minutes [13:43:47] It only really matters when multiple connections back up waiting on each other. Except that anything holding resources like this ends up mattering on a master. [13:44:15] while waiting for avconv to finish perhaps? [13:44:16] www-data 18252 147 1.4 1041756 240920 ? RNl 13:04 53:07 | \_ /usr/bin/avconv [13:44:19] etc [13:44:29] perhaps so. bad design [13:45:16] I am pretty sure that the new labvirt1007 has all the same puppet roles as labvirt1001-1006 and yet it isn’t showing up in ganglia reports. Where should I start? [13:46:31] springle: MediaWiki opens a transaction per default [13:46:56] usually you need to flush it before doing such blocking things [13:49:10] (03PS3) 10Filippo Giunchedi: graphite: split alerts role [puppet] - 10https://gerrit.wikimedia.org/r/208083 (https://phabricator.wikimedia.org/T97754) [13:49:41] ottomata: thoughts on https://gerrit.wikimedia.org/r/#/c/207805/ ? [13:49:50] andrewbogott: if it's a brand-new fresh host, try restarting ganglia-monitor service? sometimes it's hosed on first start [13:50:03] bblack: ok, trying [13:50:22] (03CR) 10Ottomata: [C: 031] varnishkafka: use statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/207805 (https://phabricator.wikimedia.org/T95687) (owner: 10Filippo Giunchedi) [13:50:26] godog: if you are ready, go for it! [13:51:08] ottomata: yep I am! no other action needed? e.g. restart? [13:51:22] naw, its a cron [13:51:40] (03PS2) 10Filippo Giunchedi: varnishkafka: use statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/207805 (https://phabricator.wikimedia.org/T95687) [13:51:40] i guess eventually remove the local statsds? [13:51:46] ottomata: Got a second? [13:51:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnishkafka: use statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/207805 (https://phabricator.wikimedia.org/T95687) (owner: 10Filippo Giunchedi) [13:51:50] sure [13:52:09] ottomata: yep, got https://gerrit.wikimedia.org/r/208635 out but jenkins says no [13:52:13] ottomata: Can you have a look at oxygen... according to Ganglia it has some weird swap setting [13:52:27] (03PS2) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [13:52:44] I wonder what swpaon -s says [13:52:46] (03PS3) 10Andrew Bogott: Add a couple of settings to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/205979 [13:52:48] (03PS1) 10Andrew Bogott: Rename virt1012 to labvirt1008. [puppet] - 10https://gerrit.wikimedia.org/r/208954 [13:53:18] ok, hoo, oxygen no swap [13:53:23] i recently reinstalled it [13:53:57] bblack: that doesn’t seem to have done it, although maybe I just need to wait longer [13:54:26] ottomata: Weird... ganglia is going nuts on it [13:54:33] link? [13:54:36] http://ganglia.wikimedia.org/latest/graph.php?h=oxygen.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1430833995&g=mem_report&z=medium&c=Miscellaneous%20eqiad [13:54:56] ha, hm weird [13:55:00] cool, look at all that swap! [13:55:14] (03PS1) 10Andrew Bogott: Rename virt1012 to labvirt1008 [dns] - 10https://gerrit.wikimedia.org/r/208955 [13:55:53] Has a lot of swap death potential :D [14:01:16] (03PS6) 10Alexandros Kosiaris: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:01:57] (03CR) 10jenkins-bot: [V: 04-1] mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:02:41] (03PS2) 10Andrew Bogott: Rename virt1012 to labvirt1008. [puppet] - 10https://gerrit.wikimedia.org/r/208954 [14:04:00] (03CR) 10Andrew Bogott: [C: 032] Rename virt1012 to labvirt1008. [puppet] - 10https://gerrit.wikimedia.org/r/208954 (owner: 10Andrew Bogott) [14:04:16] (03CR) 10Andrew Bogott: [C: 032] Rename virt1012 to labvirt1008 [dns] - 10https://gerrit.wikimedia.org/r/208955 (owner: 10Andrew Bogott) [14:07:21] !log shut fluorine to replace sdb [14:07:26] Logged the message, Master [14:09:09] (03PS7) 10Alexandros Kosiaris: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:12:49] PROBLEM - Host virt1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:59] ^that’s me, renaming. [14:13:08] Hm, somehow I scheduled downtime for all services but not for the host itself [14:18:39] RECOVERY - Host virt1012 is UPING OK - Packet loss = 0%, RTA = 0.74 ms [14:20:23] Coren: icinca is upset about labstore1002. Should I ack, or mark as downtime, or…? [14:21:03] andrewbogott: Hm. I was considering what to do about it now. I'm going to power it back up now, simply. [14:21:11] ok then :) [14:22:09] cmjohnson1: you sound maybe busy, but do you think you can get to this today? https://phabricator.wikimedia.org/T98081 [14:22:35] ottomata: I will look at it shortly...i promise :-) [14:22:50] thank you! [14:23:28] 10Ops-Access-Requests, 6operations: Requesting access to Rhenium for dkg - https://phabricator.wikimedia.org/T98148#1260568 (10dkg) 3NEW [14:26:43] ottomata: is an1037 off? [14:26:49] can I turn on? [14:27:29] yes [14:31:10] PROBLEM - Host analytics1037 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:35] RECOVERY - Host analytics1037 is UPING OK - Packet loss = 0%, RTA = 5.72 ms [14:42:45] RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset -0.00213599205 secs [14:42:47] (03PS3) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [14:48:49] akosiaris: i'm getting 503 when uploading files [14:48:56] oh, he is not here [14:49:50] Coren: labstore1002 is still red in icinga. [14:50:00] godog: graphite2001 too [14:50:12] paravoid: Yeah, it's not coming up. [14:50:29] paravoid: I think the hw issue may have been deeper than just a badly seated card after all. [14:51:15] PROBLEM - Host labvirt1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:53:46] manybubbles, ^d, thcipriani, marktraceur: Who wants to SWAT this morning? [14:53:46] too busy today - sorry:( [14:53:46] akosiaris, hi, can you help with the trebuchet usage? [14:53:46] I'm trying to poke in the BIOS to see if I can get the EFI alerts to figure out what's up [14:53:46] matt_f_night, Dereckson: Ping for SWAT in about 7 minutes [14:53:46] RECOVERY - Host labstore1002 is UPING OK - Packet loss = 0%, RTA = 2.90 ms [14:53:46] Present [14:53:56] RECOVERY - Host labvirt1008 is UPING OK - Packet loss = 0%, RTA = 0.53 ms [14:54:20] anomie: can do swat, a little worried about my internet at the moment, but I think it's just a little sluggish. [14:54:25] thcipriani: ok [14:54:38] <^d> busy as well [14:55:35] matt_flaschen: can you go ahead and merge your extension changes and bump the submodule on core for wmf{3,4} [14:55:44] thcipriani, yes, already in progress. [14:55:51] matt_flaschen: thanks! [14:56:30] Allright, I managed to have it boot at the third hardreset but I now call it officially suspect. [14:57:45] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 7.69% of data above the critical threshold [500.0] [14:57:47] ... aaaand it gets tons of IO errors. [14:57:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [14:59:22] (03PS2) 10Filippo Giunchedi: statsite: decommission class [puppet] - 10https://gerrit.wikimedia.org/r/208635 (https://phabricator.wikimedia.org/T95687) [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, matt_flaschen, Dereckson: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150505T1500). Please do the needful. [15:00:30] ACKNOWLEDGEMENT - Host labstore1002 is DOWN: PING CRITICAL - Packet loss = 100% Coren Powered down - Hardware issues [15:00:32] paravoid: mhh I'm seeing only the 500 alert? which should go away once https://gerrit.wikimedia.org/r/#/c/208083/ is merged [15:00:51] godog: and a bunch of UNKNOWNs [15:02:46] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [15:02:48] paravoid: true, the jobq ones are https://gerrit.wikimedia.org/r/#/c/207785/ just added you if you want to take a shot [15:06:28] matt_flaschen: seems to keep failing on EchoEmailFormatterTest::testEmailFormatter [15:06:53] thcipriani, yeah, I'm trying to figure out why. I didn't change anything related to that. [15:07:45] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [15:08:21] (03CR) 10Filippo Giunchedi: [C: 04-1] "IMO we can leave diamond enabled and disable the statsd reporter/handler instead, optionally report to disk" [puppet] - 10https://gerrit.wikimedia.org/r/208924 (https://phabricator.wikimedia.org/T98121) (owner: 10Hashar) [15:09:27] Hi. [15:09:55] Dereckson: howdy [15:10:07] ready to get your config change out the door? [15:10:20] I'm fine, thanks, and yes I'm ready. [15:11:05] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208634 (https://phabricator.wikimedia.org/T97995) (owner: 10Dereckson) [15:11:14] (03Merged) 10jenkins-bot: Add medialib.naturalis.nl to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208634 (https://phabricator.wikimedia.org/T97995) (owner: 10Dereckson) [15:13:55] hmm: ^d anything happening with gerrit being sluggish today? This git fetch took 1m 18s [15:14:58] <_joe_> thcipriani: it's JAVA(TM) [15:15:16] <_joe_> pick your daily garbage collection fuckup/nullpointer exception [15:15:42] <_joe_> I'm sure the gerrit logs have very informative 300 lines stack traces [15:15:58] <_joe_> for things totally unrelated to the actual problems, too [15:16:55] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: Add medialib.naturalis.nl to wgCopyUploadsDomains [[gerrit:208634]] (duration: 00m 26s) [15:17:03] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1260743 (10fgiunchedi) [] match swift capacity to eqiad (+3 machines ATM) and mirror thumbs too [15:17:12] Testing. [15:17:18] thanks [15:17:41] Works. [15:17:53] _joe_: fair. OK, rephrase: anything _more_ wrong with gerrit than normal today :) [15:18:00] Dereckson: cool. thanks. [15:18:15] Thanks for the deploy. [15:18:45] <_joe_> thcipriani: yeah my point is that the final answer will be that :P [15:19:01] _joe_: Your java punchline makes me smile, I should feel ashamed [15:19:07] outage ? [15:19:13] (Cannot access the database: Can't connect to MySQL server on '10.64.16.22' (4) (10.64.16.22)) [15:19:59] <_joe_> matanya: that's a labs address [15:20:03] <_joe_> where are you seeing that? [15:20:08] on he.wiki [15:20:16] that is even worse then [15:20:29] <_joe_> matanya: some gadget you installed? [15:20:31] <_joe_> maybe? [15:20:38] <_joe_> try in an incognito window? [15:20:38] maybe, but not recently [15:20:45] doing [15:20:51] <_joe_> or tell me the steps to reproduce [15:20:53] 6operations, 7Graphite, 5Patch-For-Review: deprecate mwprof from puppet and gerrit - https://phabricator.wikimedia.org/T97509#1260745 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi change merged, bye bye mwprof [15:21:13] seems to be working in incognito [15:21:27] probably some gadget [15:21:29] <_joe_> matanya: then it's some gadget that works from labs :) [15:22:11] thanks [15:23:48] thcipriani, okay, I'm going to withdraw our SWAT. [15:24:15] <_joe_> Dereckson: why should you feel ashamed? :) [15:24:37] matt_flaschen: OK, sounds good. [15:25:04] <_joe_> Dereckson: I have 5+ years of kicking jvm apps out of garbage collecions under my belt. I earned the right to mock java, the jvm, and all the ecosystem [15:25:07] <_joe_> :) [15:25:45] PROBLEM - Host analytics1037 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:55] <_joe_> ottomata: is that you ^^ [15:27:13] (03PS1) 10BryanDavis: logstash: remove extra $::ganglia_aggregator [puppet] - 10https://gerrit.wikimedia.org/r/208973 [15:29:07] _joe_: that is cmjohnson1 and I [15:29:09] thanks [15:32:11] thcipriani: are you done with swat then? [15:32:31] the window is still open, but all code that was scheduled is deployed. [15:32:36] coolio [15:33:15] RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0] [15:33:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:35:25] (03CR) 10Giuseppe Lavagetto: [C: 032] logstash: remove extra $::ganglia_aggregator [puppet] - 10https://gerrit.wikimedia.org/r/208973 (owner: 10BryanDavis) [15:40:48] RECOVERY - Host analytics1037 is UPING OK - Packet loss = 0%, RTA = 2.02 ms [15:41:01] ottomata: looks good! [15:41:32] ok thanks so much, sorry for the false alarm, dunno what was going on there [15:41:44] cool, can confirm it is back in teh cluster [15:41:45] oh.no false alarm...the disk was missing..idk how [15:42:24] but it's back now [15:44:48] 6operations, 10ops-eqiad: /dev/sdm not loading on analytics1037 - https://phabricator.wikimedia.org/T98081#1260802 (10Cmjohnson) 5Open>3Resolved VD13 was missing and the disk was in foreign cfg. I cleared the foreign config and re-created the virtual disk. While the server was down I also did some firmw... [15:48:35] 6operations, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Investigation if Fluorine needs bigger disks or we retain too much data - https://phabricator.wikimedia.org/T92417#1260810 (10fgiunchedi) `sdb` swapped, root and swap arrays rebuilt already, data arrays rebuilding ``` md5 : active raid1 sdb4[1] sda4... [15:52:44] bblack: Ahah! Ganglia /is/ showing metrics for labvirt1007/1008 but it’s displaying them under their old names, virt1011/1012. [15:54:31] (03PS5) 10Giuseppe Lavagetto: hiera: Add a proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/207128 (https://phabricator.wikimedia.org/T93776) [15:54:35] (03Draft1) 10Hashar: (WIP) vmbuilder with puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/208939 [15:56:19] akosiaris, ping [15:58:15] ottomata, do you know much about using trebuchet? I need to update graphoid service in prod, but would like someone to hold my hand for the first time )) [15:59:19] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: Add a proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/207128 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto) [16:01:40] <_joe_> ok, since nothing is failing horribly (and it shouldn't, really) I'm going off now. [16:10:37] 6operations, 5Patch-For-Review: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1260862 (10RobH) [16:10:37] andrewbogott: thoughts on https://gerrit.wikimedia.org/r/#/c/205553/ ? [16:10:53] 6operations, 5Patch-For-Review: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1260863 (10RobH) a:5RobH>3Springle [16:11:03] godog: I’m in a meeting but I will look. I sorta thought we fixed that already... [16:12:34] andrewbogott: ack, no we didn't, hence the code review [16:15:03] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1260868 (10Ottomata) 3NEW a:3Ottomata [16:15:24] 6operations, 10Analytics-Cluster, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1260877 (10Ottomata) [16:15:25] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1260876 (10Ottomata) [16:15:42] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1260868 (10Ottomata) [16:15:43] 6operations, 10Analytics-Cluster, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1115779 (10Ottomata) [16:15:51] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1260868 (10Ottomata) [16:15:52] 6operations, 10Analytics-Cluster, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1115779 (10Ottomata) [16:18:23] 6operations, 6Project-Creators, 7Documentation: create #vm-requests (a production vm cluster request project similar to #hardware-requests) - https://phabricator.wikimedia.org/T97330#1260882 (10RobH) 5Open>3Resolved I neglected the project creator link, my bad. (I put this task in specifically for that!... [16:18:25] 6operations, 7Documentation: Create documentation on the requesting/allocation of virtual machines in the misc cluster - https://phabricator.wikimedia.org/T97072#1260884 (10RobH) [16:19:35] (03PS1) 10BryanDavis: Update statsd events [tools/scap] - 10https://gerrit.wikimedia.org/r/208987 (https://phabricator.wikimedia.org/T64667) [16:22:41] (03PS1) 10Ori.livneh: wmgUseBits: default => false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 [16:22:47] bblack: ^ [16:23:35] (03CR) 10BryanDavis: Update statsd events (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/208987 (https://phabricator.wikimedia.org/T64667) (owner: 10BryanDavis) [16:24:07] ori: Gosh. [16:25:02] wait, we're done with bits already? [16:25:06] that was fast [16:25:36] paravoid: Right now it's off for it,de,nl,ru,es-wiki and a few small ones. [16:25:40] I know [16:25:52] paravoid: Off for enwiki too might be a prudent penultimate step, but… [16:26:11] greg-g, i have moved up the graph ext depl to the slot 9:30-11am. Hope its ok (it is in betalabs as requested :)) [16:26:47] 6operations, 7Documentation: Create documentation on the requesting/allocation of virtual machines in the misc cluster - https://phabricator.wikimedia.org/T97072#1260903 (10RobH) a:5RobH>3akosiaris Ok, #vm-requests now exists, identical to hardware requests. I've updated https://wikitech.wikimedia.org/wik... [16:27:04] greg-g, it was previously scheduled at 1pm, but its a bit late for this tz [16:27:44] "beta cluster" [16:28:51] (03PS4) 10Gage: ipsec-global: fix bug in non-verbose mode, exit if not root [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) [16:28:58] yurik_: kk [16:29:40] where is the "" coming from on https://www.mediawiki.org/wiki/MediaWiki ? [16:29:42] (03CR) 10Gage: [C: 032] ipsec-global: fix bug in non-verbose mode, exit if not root [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage) [16:30:24] ori: I'm guessing you might know ^? [16:31:10] legoktm: CentralNotice IIRC [16:31:31] yup, thanks :) [16:31:41] yurik_: yes I am now around [16:31:48] yurik_: re trebuchet [16:32:02] akosiaris, awesoem, i have the next 1.5 hrs of depl time, would love your assistance [16:32:21] (03PS10) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) [16:32:29] akosiaris, can you do hangout ? [16:32:45] yup [16:33:18] James_F: If you want Neil to get his access to stat1003 can you poke him and point him at the ticket? There are actions from him needed. [16:34:54] 10Ops-Access-Requests, 6operations: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1260936 (10coren) @trevorparscal: With your approval language, this will be good to go. [16:35:42] Coren: neilpquinn is pinged. [16:37:16] 10Ops-Access-Requests, 6operations, 6Editing-Department: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1260945 (10Jdforrester-WMF) [16:37:32] 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Give Neil Quinn access to stats1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97746#1260947 (10Jdforrester-WMF) [16:39:15] (03CR) 10Filippo Giunchedi: [C: 031] Update statsd events [tools/scap] - 10https://gerrit.wikimedia.org/r/208987 (https://phabricator.wikimedia.org/T64667) (owner: 10BryanDavis) [16:44:39] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1260966 (10fgiunchedi) for anything that's countable as lines that'll help getting us off udp2log so I think it'll work. For anything more complex than that (e.g. timings) I think we'll have to roll some... [16:45:42] 10Ops-Access-Requests, 6operations, 6Editing-Department: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1260968 (10coren) p:5Triage>3Normal [16:46:06] RECOVERY - Parsoid on wtp2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.115 second response time [16:46:16] RECOVERY - puppet last run on wtp2003 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:46:25] RECOVERY - Parsoid on wtp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.117 second response time [16:47:15] RECOVERY - puppet last run on wtp2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:47:27] 10Ops-Access-Requests, 6operations: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1260970 (10coren) @dr0ptp4kt: Care to place your imprimatur on this? [16:48:01] (03CR) 10Andrew Bogott: [C: 031] "I can't claim to understand what this does, but it's fine w/me :)" [puppet] - 10https://gerrit.wikimedia.org/r/205553 (https://phabricator.wikimedia.org/T92712) (owner: 10Filippo Giunchedi) [16:48:08] (03PS1) 10Dzahn: admin: add shell user for guillom [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) [16:48:14] 10Ops-Access-Requests, 6operations: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1260974 (10dr0ptp4kt) Approved. [16:48:25] @Coren ^ [16:48:44] @Coren ^^ [16:49:34] 6operations, 10Traffic, 7discovery-system: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1260975 (10Joe) So, given we chose to go ahead with etcd, we will use [[ https://github.com/kelseyhightower/confd | confd ]] for writing a single... [16:49:42] Yup. I saw. :-) [16:49:46] (03CR) 10Dzahn: [C: 04-1] "needs the UID, which we would match with the LDAP/labs/wikitech user, but there "guillom" doesn't exist yet, @guillom can you create that " [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn) [16:50:28] !log deployed latest graphoid 0.1.3 service [16:50:37] Logged the message, Master [16:50:56] mutante: I would have created the account on Wikitech myself, simply. :-) [16:51:23] mutante: I have an account on Wikitech/Labs. Could it be a capitalization issue? [16:51:29] Coren: eh, i just noticed the issue seems to be entirely different.. the ldap tools on terbium dont work? [16:51:46] mutante: I don't know, that's not where I check from normally. :-) [16:51:54] heh [16:52:08] (03CR) 10MaxSem: [C: 031] Enable fallback graphoid service for non-js client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207631 (owner: 10Yurik) [16:52:28] mutante: His username is 'gpaumier' [16:52:30] it's kind of the official place [16:52:35] (uid 2047) [16:52:38] there's role for ldap tools and admins [16:52:38] (03CR) 10Yurik: [C: 032] Enable fallback graphoid service for non-js client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207631 (owner: 10Yurik) [16:52:46] Oh, ugh. Sorry about the mixup. [16:52:51] aah! that explains [16:53:03] mutante: No, it works fine there too. Username mismatch. :-) [16:53:09] yes, indeed [16:53:09] (03PS1) 10Alexandros Kosiaris: Assign graphoid-admin to the SCA cluster [puppet] - 10https://gerrit.wikimedia.org/r/208998 [16:53:32] so, ehm.. usually i would recommend the same name for production but shrug? [16:53:45] mutante: I'm fine with either [16:53:54] 6operations, 10Traffic, 7discovery-system: Figure out an etcd deploy strategy that includes multi DC failure scenarios. - https://phabricator.wikimedia.org/T98165#1260982 (10Joe) 3NEW [16:54:06] I I just didn't realize that the labs username was different from the wikitech username [16:54:07] 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1260988 (10TrevorParscal) Approved. [16:54:11] Too many usernames! [16:54:30] yea, it's "shell name" vs. "wiki name" [16:54:48] Sorry for the misunderstanding [16:54:53] and then in puppet for production shell there is the resource name and another "name: " and "realname" [16:55:05] :) no worries. amending [16:55:10] I'm fine with "gpaumier" [16:55:15] ok [16:55:19] Thanks :) [16:55:37] 10Ops-Access-Requests, 6operations: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1260992 (10coren) @niedzielski: Please post a SSH key you will use, and review and sign L3 [16:55:42] (03PS2) 10Dzahn: admin: add shell user for guillom [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) [16:57:02] 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1260995 (10Dzahn) wikitech user is gpaumier. uploaded a patch that creates a "gpaumier" user in prod. [16:58:34] (03Merged) 10jenkins-bot: Enable fallback graphoid service for non-js client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207631 (owner: 10Yurik) [16:58:38] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add andyrussg to udp2log-users group to allow him to verify kafkatee generated fundraising log files on erbium - https://phabricator.wikimedia.org/T97860#1260999 (10coren) @k4-713: Can I get your approval language for this, please? [16:59:24] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add andyrussg to udp2log-users group to allow him to verify kafkatee generated fundraising log files on erbium - https://phabricator.wikimedia.org/T97860#1261001 (10K4-713) Approved! [17:00:49] !log yurik Synchronized wmf-config/CommonSettings.php: Enable graphoid noscript fallback for graph ext (duration: 00m 20s) [17:00:55] Logged the message, Master [17:01:07] (03PS1) 10Dzahn: admin: add gpaumier to ana-priv-data and bastion [puppet] - 10https://gerrit.wikimedia.org/r/209001 (https://phabricator.wikimedia.org/T98077) [17:02:34] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add andyrussg to udp2log-users group to allow him to verify kafkatee generated fundraising log files on erbium - https://phabricator.wikimedia.org/T97860#1261007 (10coren) a:3coren [17:02:56] (03PS1) 10coren: Add andyrussg to udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/209002 (https://phabricator.wikimedia.org/T97860) [17:03:19] (03CR) 10John F. Lewis: [C: 031] admin: add shell user for guillom [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn) [17:03:48] (03CR) 10John F. Lewis: [C: 031] admin: add gpaumier to ana-priv-data and bastion [puppet] - 10https://gerrit.wikimedia.org/r/209001 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn) [17:04:14] (03PS3) 10Dzahn: admin: add shell user for guillom [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) [17:04:38] Coren: merci bien! [17:04:42] 10Ops-Access-Requests, 6operations: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1261013 (10Niedzielski) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCvAn78DE2Wj3QZFQSYe19eAqcGWZXQZA/TPuDjFtSdBU9yqsdcWUzfpN8ZN+dpvvQyBLbKf2MxYD2ghoo0WdUdcRoxB/7XyP5xsHLW4BRYtEf0XlPP9uC... [17:05:54] 6operations, 10ops-eqiad: Failed disk db1004 - https://phabricator.wikimedia.org/T97814#1261019 (10Cmjohnson) 5Open>3Resolved disk 10 is online. [17:06:04] (03CR) 10coren: [C: 032] "Simple group addition." [puppet] - 10https://gerrit.wikimedia.org/r/209002 (https://phabricator.wikimedia.org/T97860) (owner: 10coren) [17:07:56] (03PS4) 10Dzahn: admin: add shell user for guillom [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) [17:08:16] guillom :-) [17:10:17] (03CR) 10Dzahn: [C: 032] "doing this part because it just sets up the user without groups. leaving the other one as the actual access requests" [puppet] - 10https://gerrit.wikimedia.org/r/208997 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn) [17:11:09] (03CR) 10Filippo Giunchedi: "it'll make sure cgroup-bin is installed and all cgroups are where mw's limit.sh expects them, will merge tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/205553 (https://phabricator.wikimedia.org/T92712) (owner: 10Filippo Giunchedi) [17:12:08] (03PS2) 10Dzahn: admin: add gpaumier to ana-priv-data and bastion [puppet] - 10https://gerrit.wikimedia.org/r/209001 (https://phabricator.wikimedia.org/T98077) [17:13:06] akosiaris: could you check https://wikitech.wikimedia.org/wiki/Graphoid for correctness? [17:13:20] (03CR) 10Dzahn: "@Coren this one for you then to confirm" [puppet] - 10https://gerrit.wikimedia.org/r/209001 (https://phabricator.wikimedia.org/T98077) (owner: 10Dzahn) [17:13:59] 6operations, 10hardware-requests: Eqiad: 1 hardware access request for puppetmaster service scale out - https://phabricator.wikimedia.org/T98166#1261036 (10akosiaris) 3NEW [17:14:44] gwicke: yeah, on it [17:15:00] akosiaris: thanks! [17:15:45] PROBLEM - puppet last run on mw2031 is CRITICAL puppet fail [17:16:27] (03CR) 10Dzahn: [C: 031] "some might argue whether this should be in operations/software or here if it's not used by puppet, but i think it would be fine" [puppet] - 10https://gerrit.wikimedia.org/r/208395 (owner: 10Alex Monk) [17:17:10] (03CR) 10Dzahn: "+chasemp" [puppet] - 10https://gerrit.wikimedia.org/r/208395 (owner: 10Alex Monk) [17:17:39] 6operations, 5Patch-For-Review: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1261053 (10akosiaris) [17:17:58] 6operations, 10hardware-requests: Eqiad: 1 hardware access request for puppetmaster service scale out - https://phabricator.wikimedia.org/T98166#1261057 (10akosiaris) [17:18:00] 6operations, 5Patch-For-Review: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1260135 (10akosiaris) [17:18:15] PROBLEM - puppet last run on analytics1016 is CRITICAL Puppet has 1 failures [17:19:18] akosiaris: half of hte time the requestor of the hardware knows the vlan [17:19:27] and if they dont specify, i end up asking anyhow [17:19:31] so im gonna keep asking ;D [17:19:43] so you dont want a much better system for this eh? just same as palladium... [17:20:01] well, mem/disk would be a waste [17:20:15] and we do put into good use our older hardware [17:20:28] but if you got a box with more CPU and same disk/mem specs [17:20:30] I 'd be happy [17:20:52] btw, how am I supposed to know the vlan in this case ? [17:21:02] you can pick up any box from any row [17:21:09] s/pick up/pick/ [17:21:17] and VLANs are according to rows :-) [17:21:47] public, private, labs, analytics, sandbox [17:22:08] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1261065 (10coren) [17:22:10] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add andyrussg to udp2log-users group to allow him to verify kafkatee generated fundraising log files on erbium - https://phabricator.wikimedia.org/T97860#1261063 (10coren) 5Open>3Resolved This should be applied now, or will be very shortly (next pupp... [17:22:12] and if its analytics or labs, then its row dependent on where i can put it since htey dont exist in all rows [17:22:32] but, if they dont know, they put 'i dont know' and answer further questions to help me decide [17:22:49] I thought about putting all the vlan options in there, but they change and I don't want to replicate that work. [17:23:14] where is the authoritative list of vlans? [17:23:27] good question, no idea. [17:23:35] so the switches can have them, or the dns template files [17:23:45] but, just having it in one doesnt mean its in the other [17:23:50] so you have to look at both and just figure it out [17:24:05] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 6.67% of data above the critical threshold [500.0] [17:24:15] akosiaris: I just laughed at "bad question. This is a technicality that should be added by Ops, not request the user to provide it." :) [17:24:17] or, if you happen to have another machine, like a sister/mirror, folks put things like 'same vlan as server x' [17:24:25] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [17:24:41] im glad you guys actually read my email about all these things i sent out months ago =[ [17:24:50] * robh isnt sure why he bothers to update docs [17:25:35] robh: because then you can say 'go look at the docs' while you do more useful work [17:25:39] robh: cause it's good ? [17:25:56] I actually did read them btw, I just missed the VLAN part [17:25:57] also because every other time people don't ask you it is harder to notice [17:27:11] 6operations, 10hardware-requests: Eqiad: 1 hardware access request for puppetmaster service scale out - https://phabricator.wikimedia.org/T98166#1261085 (10RobH) a:3RobH [17:27:51] i dont get why its a bad question when half the folks who request it know the answer. if they dont, they say so, and we move on. [17:28:10] so im keeping it since i have to process the requests. [17:28:38] akosiaris: you would say 'whatever vlan the puppetmaster needs to be in' [17:28:41] and you're set. [17:29:05] im not asking for a checklist of vlan names that i expect folks to have memorized, or i'd have listed them all. [17:29:13] hehe... which is any btw [17:29:18] no [17:29:21] if i put it in labs [17:29:22] because it can be in any private vlan [17:29:23] your fucked [17:29:24] so its not any. [17:29:29] ahahaha [17:29:40] and half the requests in the past 6 months have been labs or analytics, which are special vlans [17:29:41] come on, you wouldn't do that now, would you ? [17:29:44] so yes, i could remove it [17:29:48] and then we can have the back and forth [17:30:03] akosiaris: the oppostie has happened twice though [17:30:11] sigh [17:30:11] folks want a machine, i put in default private, and it needs to be somehthing else [17:30:23] so rather than halfing a 24/48 hour back and forth on every single ticket [17:30:34] i ask in the initial form, if they dont know, they know im going to ask questions to determine it. [17:30:34] ok I get your point [17:30:49] what I am saying is that asking the VLAN explicitly is a technicallity [17:30:52] 6operations, 10ops-eqiad, 10Incident-20141130-eqiad-C4: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1261097 (10faidon) OK, I also `request virtual-chassis mode mixed` the switch and rebooted it to handle it being added into a mixed VC. The steps for tomorrow would be: 1) @Cmjohns... [17:31:08] that will confuse some ppl [17:31:15] akosiaris: then rewrite a form that covers all the info i need and reduces the back and forth accordingly [17:31:18] but we could rephrase it [17:31:30] because what i have now is what ive gotten to with no one bothering to give feevback when i ask [17:31:46] cause the best feedback is when you don't ask :P [17:31:53] (03PS1) 10Yurik: Enabled graph ext on all wikis except wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209009 [17:32:06] anyway, I can rewrite that form anyway, I have to review the one for vm-requests [17:32:21] well, just please check with me, since i have to deal with it [17:32:21] so I 'll do both. I anyway have to answer the same question for VMs [17:32:26] you can handle the vm requests however you like [17:32:26] ok [17:32:41] but not today. 20:30 here, I am signing off [17:33:46] RECOVERY - puppet last run on mw2031 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:33:54] akosiaris: sorry if i was snappy, im in a short mood today and didnt sleep [17:34:00] not your fault, im just being too snappish. [17:34:02] I dunno what it is with admin/data/data.yaml but unless I turn syntax highlighting off it makes vim really, really mad [17:34:35] my brain decided to wake me up every couple of hours for no reason. [17:35:03] Coren: it's something with the default script that highlights yaml [17:35:53] (03CR) 10MaxSem: [C: 031] Enabled graph ext on all wikis except wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209009 (owner: 10Yurik) [17:35:55] i agree we can rephrase the vlan question into something that is more clear and still provides the information required [17:36:00] Coren: http://stackoverflow.com/questions/20663169/vim-really-slow-with-long-yaml [17:36:13] Coren: https://github.com/stephpy/vim-yaml [17:42:14] (03PS2) 10Yuvipanda: zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 [17:42:38] !log yurik Synchronized php-1.26wmf4/extensions/Graph: Cherrypicked Graph ext 209004 (duration: 00m 20s) [17:42:46] Logged the message, Master [17:42:52] (03CR) 10Yuvipanda: "Set the default only for labs, so it'll fail in prod if hiera isn't set for some reason (which is correct behaviour, I think)" [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda) [17:43:36] !log yurik Synchronized php-1.26wmf3/extensions/Graph: Cherrypicked Graph ext 209004 (duration: 00m 16s) [17:43:40] Logged the message, Master [17:44:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [17:45:09] (03CR) 10Yurik: [C: 032] Enabled graph ext on all wikis except wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209009 (owner: 10Yurik) [17:47:23] (03PS1) 10coren: stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) [17:47:35] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [17:48:03] (03CR) 10jenkins-bot: [V: 04-1] stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) (owner: 10coren) [17:48:49] (03Merged) 10jenkins-bot: Enabled graph ext on all wikis except wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209009 (owner: 10Yurik) [17:49:54] 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10RobH) 3NEW a:3RobH [17:50:19] 6operations, 10hardware-requests: Eqiad: 1 hardware access request for puppetmaster service scale out - https://phabricator.wikimedia.org/T98166#1261200 (10RobH) [17:50:23] 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261199 (10RobH) [17:50:26] !log yurik Synchronized wmf-config/InitialiseSettings.php: Enable graph extension on all wikis except wikidata (duration: 00m 19s) [17:50:34] Logged the message, Master [17:50:40] 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10RobH) [17:50:42] 6operations, 5Patch-For-Review: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1261216 (10RobH) [17:50:43] 6operations, 10hardware-requests: Eqiad: 1 hardware access request for puppetmaster service scale out - https://phabricator.wikimedia.org/T98166#1261211 (10RobH) 5Open>3Resolved allocating server rhodium: Dell PowerEdge R610, dual Intel Xeon X5647, 16 GB Memory resolving this request. Setup of system is... [17:50:55] hi ottomata [17:51:01] I made it work by default on labs :D [17:51:46] oh? [17:51:58] ah new ptach [17:52:11] hm ok that's fine [17:52:44] (03PS2) 10coren: stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) [17:52:46] (03CR) 10Ottomata: [C: 032] zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda) [17:52:52] yuvipanda: +2 didn't merge [17:53:20] 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261260 (10RobH) [17:53:22] 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1261262 (10coren) All set for the 3-day period set to end May 8. [17:54:39] ottomata: haha :D don’t do that :P you should +1 and not merge [17:54:45] +2 == merge is how it works on all other repos... [17:55:04] i thought +2 is looks good go ahead and merge. and submit is real merge [17:55:37] (03CR) 10Ottomata: [C: 031] zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda) [17:55:50] ottomata: nah, if you +2 it you see it through, I think... [17:55:54] it’s nto codified anywhere tho [17:56:01] greg-g, its alive :) [17:56:04] whaatevaahhh ok +1ed :_) [17:56:05] :) [17:56:34] ottomata: :D I’ll merge it later today? [17:56:43] ottomata: what’s failure mode if ZK fails, btw? analytics cluster grinds to a halt? [17:56:46] just checking! [17:56:58] kafka explodes, thats all [17:57:45] yuvipanda: , i think if zk dies, or if brokers get misconfigured with bad zk host data, they wont' be able to elect any leaders, and will stop accepting produce requests [17:57:58] MaxSem, akosiaris thank you for your help! I will write the announcement about the new capability soon [17:58:21] i don't think there will be any consistency problems though, just data will stop flowing in from varnishkafkas [17:58:43] ottomata: alright. I am sure several people will kill me if that happens, so I’ll puppet-compiler this and do it carefully :) [17:59:22] (03CR) 10Dzahn: [C: 031] stat1002/1003 access for sniedzielski [puppet] - 10https://gerrit.wikimedia.org/r/209010 (https://phabricator.wikimedia.org/T97866) (owner: 10coren) [17:59:24] (03PS1) 10RobH: setting rhodium install parameters [puppet] - 10https://gerrit.wikimedia.org/r/209014 [17:59:52] well, yuvipanda good news is, daemons aren't subscribed (better double check with zk) [18:00:09] (03CR) 10RobH: [C: 032] setting rhodium install parameters [puppet] - 10https://gerrit.wikimedia.org/r/209014 (owner: 10RobH) [18:00:10] aaah [18:00:10] so, if you make a bad config change, they shouldn't pick it up unless you manually restarted daemons [18:00:10] twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150505T1800). [18:00:15] ottomata: ah, cool. [18:00:27] ottomata: I’m going to go to a visa interview now tho [18:00:32] k [18:00:33] ok deployment time [18:00:40] ottomata: I’ll brb when the world ends [18:00:44] or when the interview finishes [18:00:46] whichever is first [18:00:53] hmm, zk is subscribed [18:00:55] kafkas aren't [18:01:26] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1261409 (10coren) @ottomata: All clear from you, since this is a stats server? [18:01:49] yuvipanda: good luck! [18:02:47] 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261451 (10RobH) [18:03:07] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1261454 (10coren) >>! In T97866#1253700, @bearND wrote: > While you're at it, please also add him as a member of https://wikitech.wikimedia.org/wiki/Nova_Resourc... [18:03:30] 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10RobH) [18:04:38] (03PS2) 10Ori.livneh: wmgUseBits: false for all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 [18:04:40] (03PS1) 10Ori.livneh: wmgUseBits: false for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209016 [18:04:53] (03CR) 10jenkins-bot: [V: 04-1] wmgUseBits: false for all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 (owner: 10Ori.livneh) [18:06:05] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting stat1002/1003 access for sniedzielski - https://phabricator.wikimedia.org/T97866#1261518 (10Ottomata) All clear from me, but it is not clear what services are being asked for. See: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#A... [18:06:48] (03PS1) 1020after4: Group1 wikis to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209018 [18:08:01] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1261544 (10Ottomata) Ja, agreed. One nice thing about this approach, is the statsd_sender thing reads whatever from stdin, so as long as whatever we come up with can do the same pipe thing, e.g. varn... [18:08:02] Need moar tea [18:09:12] (03PS1) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 [18:09:35] wow for once I have something good to say about gerrit: it does a good job of matching pretty-printed json against minimized json: https://gerrit.wikimedia.org/r/#/c/209018/1/wikiversions.json,cm (smart diff algorithm!) [18:10:08] (03CR) 1020after4: [C: 032] Group1 wikis to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209018 (owner: 1020after4) [18:10:16] (03Merged) 10jenkins-bot: Group1 wikis to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209018 (owner: 1020after4) [18:12:03] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: Group1 wikis to 1.26wmf4 [18:12:44] (03CR) 10Krinkle: [C: 031] "Should this be part of a package shared with prod and other uses of MediaWiki?" [puppet] - 10https://gerrit.wikimedia.org/r/205553 (https://phabricator.wikimedia.org/T92712) (owner: 10Filippo Giunchedi) [18:17:23] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, and 2 others: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1261805 (10daniel) @JanZerebecki: Redirects are serialized like this: {"entity":"Q23","redirect":"Q42"} Old style... [18:17:29] ori: Can you rename wgAssetsHost to $wmgAssetsHost? I spent a minute looking for it in mediawiki-core after realising it was added to wmf-config only. [18:17:36] (or something like that) [18:18:20] (03PS1) 10Ottomata: [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 [18:18:59] ori: here now [18:19:00] csteipp, all's good, new ver deployed, thx to akosiaris [18:19:03] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 (owner: 10Ottomata) [18:19:06] you decided to split it after all? [18:19:48] Krinkle: I think the AssetsHost thing is temporary anyways and will be gone later in the weekl [18:19:55] (03CR) 10Jdlrobson: [C: 031] "How can I get this deployed? Should I schedule it for SWAT ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [18:19:59] yurik_: Cool. For tracking, do add a link to the patchset to the bug. [18:20:04] not sure, though. I think wmgUseBits is temporary at least [18:20:17] well, that variable is named just fine :) [18:21:18] PROBLEM - puppet last run on mw1137 is CRITICAL Puppet has 1 failures [18:22:01] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1261827 (10coren) 3NEW [18:22:42] ori: seems your pretty printing patch didn't work [18:23:02] (03CR) 10Alex Monk: "Or ask Greg for a specific deployment window, but yeah." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [18:23:34] 6operations, 10Datasets-General-or-Unknown, 10Wikidata, 3Wikidata-Sprint-2015-04-07, and 2 others: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1261851 (10daniel) Btw, if someone can tell me where to find a full history dump of wikidata, I'd be happy to check t... [18:23:40] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1261852 (10Slaporte) >>! In T85141#1256595, @JohnLewis wrote: > Still pending an approval form @slaporte (or anyone else from legal who deals with data release). You can proceed wit... [18:23:58] (03PS2) 10Ottomata: [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 [18:24:38] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 (owner: 10Ottomata) [18:25:28] PROBLEM - puppet last run on rhodium is CRITICAL Puppet has 15 failures [18:25:57] twentyafterfour: what didn't work? [18:26:14] it didn't pretty print [18:26:23] it wasn't expected to [18:26:26] oh [18:26:30] it was expected to just not break :) [18:26:36] and to pretty-print again once we update tin [18:26:36] ok then I guess it worked ;) [18:26:47] PROBLEM - DPKG on rhodium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:26:55] Sorry I didn't follow the discussion closely enough [18:27:08] (03PS3) 10Ori.livneh: wmgUseBits: false for all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 [18:27:14] twentyafterfour: np, thanks for the reviews! [18:27:31] bblack: yeah, why not. [18:27:36] ready for all-but-enwiki? [18:27:53] ori: gladly, any time. though my local testing didn't catch the bug ;) [18:28:24] twentyafterfour: I guess you could post-process the json file with `python -m json.tool` to make it pretty before pushing up to gerrit for now [18:28:36] bd808: i thought about that, yeah [18:28:45] bd808: gerrit's diff screen actually handles it ok [18:28:52] cool [18:29:00] it formats the diff so that I can read it well enough [18:29:12] (03PS2) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 [18:29:40] it even identified the changes between the pretty and non-pretty version, and highlighted just the version number digit that changed for each wiki [18:29:54] the first time I've ever been impressed with gerrit [18:30:27] gerrit praise == 1; gerrit gripes OVERFLOW [18:31:28] ori: yes [18:31:32] engage! [18:32:05] 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1261910 (10Dzahn) @qgil Does he have to sign L2? [18:32:13] (03CR) 10Ori.livneh: [C: 032] wmgUseBits: false for all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 (owner: 10Ori.livneh) [18:32:19] (03Merged) 10jenkins-bot: wmgUseBits: false for all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208990 (owner: 10Ori.livneh) [18:33:44] !log ori Synchronized wmf-config/InitialiseSettings.php: I2ee277293: wmgUseBits: false for all but enwiki (duration: 00m 13s) [18:33:51] Logged the message, Master [18:35:01] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1261923 (10csteipp) Using sanitizme.pl seems like the right way to redact this. If that was run, then should be ok for security bugs and deleted comments. [18:36:21] ori: Can you rename wgAssetsHost to $wmgAssetsHost? I spent a minute looking for it in mediawiki-core after realising it was added to wmf-config only. <-- yes [18:36:22] ori: can already see inbound bits traffic drop for esams: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Bits+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report [18:37:09] twentyafterfour: there is an issue with wikidata [18:37:24] 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1261931 (10Dzahn) fwiw, i did the LDAP group part [terbium:~] $ ldaplist -l group nda | grep chill member: uid=multichill,ou=people,dc=wikimedia,dc=org but the NDA volunteer process is (... [18:37:25] and plummeting again now of course: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Bits+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report [18:37:36] can we put wikidata back on 1.26wmf3 while i investigate? [18:37:38] I suspect eqiad won't see such dramatic swings until enwiki goes [18:37:55] * aude tries to reproduce [18:38:08] RECOVERY - puppet last run on mw1137 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:38:09] 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1261933 (10Dzahn) @multichill would you mind signing L2 anyways? It has been approved by legal after you signed your original paper NDA afaict. [18:38:50] (03PS2) 10Ori.livneh: wmgUseBits: false for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209016 [18:38:52] (03PS1) 10Ori.livneh: Rename $wgAssetsHost to $wmgAssetsHost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209026 [18:38:57] maybe can quickly fix instead [18:39:07] (03CR) 10Ori.livneh: [C: 032] Rename $wgAssetsHost to $wmgAssetsHost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209026 (owner: 10Ori.livneh) [18:39:27] bblack: i feel good about going ahead with enwiki if you're up for it [18:40:01] yup may as well. the primary caching is physically separate in practice anyways [18:40:17] (03CR) 10Ori.livneh: [C: 032] wmgUseBits: false for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209016 (owner: 10Ori.livneh) [18:40:43] (03PS3) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 [18:41:07] (03PS3) 10Ottomata: [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 [18:41:44] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize HA YARN ResourceManager for Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/209021 (owner: 10Ottomata) [18:42:17] (03Merged) 10jenkins-bot: Rename $wgAssetsHost to $wmgAssetsHost [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209026 (owner: 10Ori.livneh) [18:42:18] ori: my plan on the cached-bits-refs issues is basically wait for the bits traffic graphs to plane out a bit (probably within a day or two), then look at planning and/or executing some varnish bans with date cutoffs to get rid of the tail end. [18:42:19] (03Merged) 10jenkins-bot: wmgUseBits: false for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209016 (owner: 10Ori.livneh) [18:43:41] !log ori Synchronized wmf-config: Ia98fc4c5d: wmgUseBits: false for enwiki (duration: 00m 17s) [18:43:50] Logged the message, Master [18:44:27] (and then we get into the "wtf is left" investigation at lower priority) [18:44:39] (03PS4) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 [18:44:57] PROBLEM - Host rhodium is DOWN: PING CRITICAL - Packet loss = 100% [18:45:18] 6operations, 5Patch-For-Review: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1261955 (10chasemp) serious question, have we ever considered going masterless? With Etcd as a secret store I think it should be totally doable and allows faster rollouts our the infrast... [18:45:45] grrrrr [18:45:57] wtf, whoever decommissioned rhodium didnt do it right [18:46:05] it wasnt wiped, an dit wasnt pulled from icinga.. wtf [18:46:06] <- not it! [18:46:13] ocg system. [18:46:22] i dont feel like digging to blame, but annoying =P [18:47:02] (03PS1) 10Ori.livneh: Update $wgULSFontRepositoryBasePath for post-bits world [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 [18:47:18] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [18:47:35] (03CR) 10Ori.livneh: [C: 032] Update $wgULSFontRepositoryBasePath for post-bits world [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 (owner: 10Ori.livneh) [18:47:40] (03Merged) 10jenkins-bot: Update $wgULSFontRepositoryBasePath for post-bits world [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209027 (owner: 10Ori.livneh) [18:47:47] PROBLEM - puppet last run on analytics1037 is CRITICAL Puppet last ran 1 day ago [18:48:24] ori: not /w/static/ on 209027? [18:48:50] bblack: no, just static [18:48:59] both work but /w/static/ is for back-compat [18:49:16] example :https://en.wikipedia.org/static/current/extensions/UniversalLanguageSelector/resources/css/ext.uls.buttons.css [18:49:18] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:49:28] RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0] [18:49:41] !log ori Synchronized wmf-config/CommonSettings.php: I5978a3910: Update $wgULSFontRepositoryBasePath for post-bits world (duration: 00m 18s) [18:49:50] ori: the hash stuff assumings /w/static/ [18:49:52] Logged the message, Master [18:49:55] *assumes [18:50:19] bblack: d'oh, you're right [18:50:23] I'll fix it [18:50:27] no no [18:50:58] RECOVERY - puppet last run on analytics1037 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:51:01] (03PS1) 10BBlack: better static-assets regex for now [puppet] - 10https://gerrit.wikimedia.org/r/209028 [18:51:08] no no? [18:51:33] your patch is fine, disregard. but maybe we can add a trailing slash since everything is migrated now? [18:51:54] I think that has to wait for all kinds of objects to fall of the cache first that ref the old paths [18:52:05] until I go muck with forcing them out, I think the cutoff is like 60d? [18:52:11] 30 [18:52:21] but ok, makes sense [18:52:29] (03CR) 10Ori.livneh: [C: 031] better static-assets regex for now [puppet] - 10https://gerrit.wikimedia.org/r/209028 (owner: 10BBlack) [18:52:50] (03CR) 10BBlack: [C: 032 V: 032] better static-assets regex for now [puppet] - 10https://gerrit.wikimedia.org/r/209028 (owner: 10BBlack) [18:53:09] (03PS5) 10Ottomata: Configure YARN HA ResourceManager [puppet/cdh] - 10https://gerrit.wikimedia.org/r/209019 [18:56:04] (03PS1) 10RobH: rhodium had wrong fqdn [puppet] - 10https://gerrit.wikimedia.org/r/209030 [18:56:31] ori: actually I had to go look again, but: 30 is our def TTL, but we don't seem to cap it (probably should) [18:56:34] (03CR) 10RobH: [C: 032] rhodium had wrong fqdn [puppet] - 10https://gerrit.wikimedia.org/r/209030 (owner: 10RobH) [18:56:53] I think the only way it could go higher would be with a backend's header saying to do so [18:59:18] * bblack kick-starts salt, again [19:11:54] PROBLEM - Varnishkafka log producer on cp3030 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [19:12:24] ^ I'm already looking at cp3030 [19:13:04] aude: sorry I didn't see your message, was offline for a few minutes due to power problems. Is there still an issue? [19:14:24] twentyafterfour: we have a patch coming [19:14:37] i'll want to deploy it asap and can do myself [19:15:53] aude: ok let me know if I can help [19:16:41] sorry I watched the fatalmonitor for a while but I didn't notice any issues [19:16:49] twentyafterfour: it's an exception [19:17:06] schema change not applied yet and code expecting some field to be there [19:17:23] * aude waits for jenkins [19:20:45] ottomata: gonna merge and babysuit patch now [19:20:48] err [19:20:49] babysit [19:21:47] aude: I've been trying to come up with an elegant solution to schema changes...for years... and I still can't come up with anything better than the migrations systems most of the big frameworks are using these days. We should probably adopt something like that as well. [19:21:54] RECOVERY - Varnishkafka log producer on cp3030 is OK: PROCS OK: 1 process with command name varnishkafka [19:22:09] twentyafterfour: it is somewhat complicated since springle has to do most of them [19:22:16] if it's modifying a large existing table [19:22:34] would be nice if more people could handle them [19:23:15] yeah. There doesn't seem to be a really good solution for large scale deployment of sql schema changes [19:23:38] !log disabled puppet on zookeeper hosts [19:23:46] a good dba just can't be automated it seems ;) [19:23:47] Logged the message, Master [19:23:47] (03PS3) 10Yuvipanda: zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 [19:24:04] (03CR) 10Yuvipanda: [C: 032] zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda) [19:24:31] adding tables is easy :) [19:24:39] but not modifying [19:24:49] yuvipanda: ok [19:25:17] ottomata: disabled puppet on the zookeeper hosts, and have one of the kafka servers open so I can see what puppet drags in [19:25:47] (03CR) 10Yuvipanda: [V: 032] zookeeper: Refactor roles to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/208849 (owner: 10Yuvipanda) [19:26:47] (03PS1) 10coren: Labs: Add jamvm explicitly on all flavours [puppet] - 10https://gerrit.wikimedia.org/r/209038 (https://phabricator.wikimedia.org/T98195) [19:27:26] ottomata: yup, all nop :D [19:27:28] wheee [19:27:38] ottomata: thanks for the review :) [19:27:49] ALTER table `dba` ADD column backup [19:28:08] great, thanks yuvipanda :) [19:28:43] (03PS6) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 [19:29:02] ottomata: do we have a HDFS role in prod that I can use? [19:29:12] in pupppet you mean? [19:29:15] to use in labs? [19:29:15] ottomata: yeah [19:29:17] yeah [19:29:18] yes [19:29:31] * aude shall deploy now [19:29:43] 6operations, 7Shinken: Shinken hostname column is not large enough - https://phabricator.wikimedia.org/T1362#1262124 (10hashar) [19:29:45] ottomata: how hard will it be, to, say, have a 5 node cluster? [19:30:02] yuvipanda: not hard, but i haven't done work to make it work with hiera yet [19:30:10] ottomata: ah, hmm. [19:30:12] looking for instructinos [19:30:14] PROBLEM - puppet last run on cp4019 is CRITICAL puppet fail [19:30:26] ottomata: the zookeeper patch was essentially making it work with hiera :) [19:30:43] PROBLEM - puppet last run on cp4001 is CRITICAL puppet fail [19:30:44] PROBLEM - puppet last run on cp3041 is CRITICAL puppet fail [19:30:51] 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1262133 (10Multichill) >>! In T87097#1261933, @Dzahn wrote: > @multichill would you mind signing L2 anyways? It has been approved by legal after you signed your original paper NDA afaict. Y... [19:30:51] well, the hadoop role kinda works the same way [19:30:53] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL 7.14% of data above the critical threshold [500.0] [19:30:53] PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail [19:30:53] PROBLEM - puppet last run on cp3006 is CRITICAL puppet fail [19:30:54] PROBLEM - puppet last run on cp3004 is CRITICAL puppet fail [19:30:54] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [19:30:58] expecting global vars from labsconsole interface [19:31:04] PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail [19:31:12] mutante: good one. [19:31:18] joal: can you find that really nice wiki page from qchris on how to set up hadoop in labs? [19:31:24] PROBLEM - puppet last run on cp4005 is CRITICAL puppet fail [19:31:40] ottomata: For sure, give me aminute [19:31:40] ah, found it! [19:31:42] https://wikitech.wikimedia.org/wiki/User:QChris/TestClusterSetup [19:31:43] nm [19:32:04] PROBLEM - puppet last run on cp4018 is CRITICAL puppet fail [19:32:04] PROBLEM - puppet last run on cp3003 is CRITICAL puppet fail [19:32:33] PROBLEM - puppet last run on cp3005 is CRITICAL puppet fail [19:32:54] bblack: ^ ? [19:33:23] PROBLEM - puppet last run on cp3009 is CRITICAL puppet fail [19:33:46] ESC[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item zookeeper_hosts in any Hiera data file and no default supplied at /etc/puppet/manifests/role/analytics/kafka.pp:109 on node cp3041.esams.wmnet [19:33:50] who broke it? :P [19:33:53] PROBLEM - puppet last run on cp3035 is CRITICAL puppet fail [19:33:59] bblack: ugh [19:34:01] bblack: that’s me. [19:34:03] ok [19:34:05] PROBLEM - puppet last run on cp3031 is CRITICAL puppet fail [19:34:08] bblack: but, I have it in eqiad.yaml [19:34:15] so that should work... [19:34:15] but these hosts are not in eqiad [19:34:18] aaarggh [19:34:19] I see [19:34:21] lol [19:34:23] I didn’t know it was cross dc [19:34:32] let me move that then [19:34:36] I tested on a eqiad kafka host [19:34:38] * aude waits for jenkins [19:34:41] but I guess that’s the broker [19:35:14] PROBLEM - puppet last run on cp3034 is CRITICAL puppet fail [19:35:34] PROBLEM - puppet last run on cp3049 is CRITICAL puppet fail [19:35:47] !log rebooting cp3030 ... [19:35:56] (03PS1) 10Yuvipanda: kafka: Move hiera data for zookeepr hosts to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/209041 [19:35:58] Logged the message, Master [19:36:19] (03CR) 10Yuvipanda: [C: 032 V: 032] kafka: Move hiera data for zookeepr hosts to common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/209041 (owner: 10Yuvipanda) [19:36:24] PROBLEM - puppet last run on cp4020 is CRITICAL puppet fail [19:36:34] PROBLEM - puppet last run on cp3019 is CRITICAL puppet fail [19:37:23] PROBLEM - puppet last run on cp4002 is CRITICAL puppet fail [19:37:35] PROBLEM - puppet last run on cp4009 is CRITICAL puppet fail [19:37:42] uhm [19:37:44] PROBLEM - puppet last run on cp3039 is CRITICAL puppet fail [19:37:59] (03PS1) 10Mattflaschen: Flow should use VE by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209042 (https://phabricator.wikimedia.org/T98168) [19:38:03] PROBLEM - puppet last run on cp4012 is CRITICAL puppet fail [19:38:04] bblack: can’t I get to these hosts from iron? [19:38:13] PROBLEM - puppet last run on cp3018 is CRITICAL puppet fail [19:38:13] PROBLEM - puppet last run on cp3021 is CRITICAL puppet fail [19:38:14] PROBLEM - puppet last run on cp3047 is CRITICAL puppet fail [19:38:14] PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100% [19:39:12] lol [19:39:20] and yes [19:39:46] try hooft for esams maybe? that's what I use, but it shouldn't be necessary [19:39:54] PROBLEM - puppet last run on cp4010 is CRITICAL puppet fail [19:39:54] PROBLEM - puppet last run on cp3048 is CRITICAL puppet fail [19:39:54] PROBLEM - puppet last run on cp3017 is CRITICAL puppet fail [19:40:05] PROBLEM - puppet last run on cp3038 is CRITICAL puppet fail [19:40:14] PROBLEM - puppet last run on cp3007 is CRITICAL puppet fail [19:40:34] PROBLEM - puppet last run on cp4016 is CRITICAL puppet fail [19:40:41] cp3030 is down? [19:41:04] PROBLEM - puppet last run on cp4011 is CRITICAL puppet fail [19:41:16] no, he rebooted it [19:41:37] mutante: thanks, i missed that log line [19:41:56] !log aude Synchronized php-1.26wmf4/extensions/Wikidata: Fix usage tracking issue on Wikidata (duration: 00m 40s) [19:42:04] Logged the message, Master [19:42:13] RECOVERY - Host cp3030 is UPING OK - Packet loss = 0%, RTA = 88.96 ms [19:42:53] (03PS1) 10Thcipriani: Deployment group for trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/209045 (https://phabricator.wikimedia.org/T97775) [19:43:52] heh "UPING OK" -. bad de-dupe regex on icinga alerts? [19:44:48] !log aude Synchronized php-1.26wmf4/extensions/Wikidata: Fix usage tracking issue on Wikidata - with submodule update (duration: 00m 33s) [19:44:53] Logged the message, Master [19:45:20] * aude is done [19:47:14] RECOVERY - HTTP 5xx req/min on graphite2001 is OK Less than 1.00% above the threshold [250.0] [19:47:14] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:47:14] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:47:23] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:47:24] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:48:14] RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:48:19] 7Puppet, 6operations, 10Beta-Cluster, 5Patch-For-Review: Trebuchet on deployment-bastion: wrong group owner - https://phabricator.wikimedia.org/T97775#1262210 (10thcipriani) Pushed my patch up and attached to this bug. As I was reviewing this patch, I actually think it may be a better idea to have the dep... [19:48:44] RECOVERY - puppet last run on cp4001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:48:44] RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:48:51] 7Puppet, 6operations, 10Beta-Cluster, 5Patch-For-Review: Trebuchet on deployment-bastion: wrong group owner - https://phabricator.wikimedia.org/T97775#1262216 (10thcipriani) [19:48:54] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:48:54] RECOVERY - puppet last run on cp3005 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:49:26] (03PS1) 10BBlack: purge intel-microcode, will remove after [puppet] - 10https://gerrit.wikimedia.org/r/209048 [19:49:33] RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:49:44] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:50:04] RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:50:04] RECOVERY - puppet last run on cp3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:51:14] (03CR) 10BBlack: [C: 032] purge intel-microcode, will remove after [puppet] - 10https://gerrit.wikimedia.org/r/209048 (owner: 10BBlack) [19:51:34] RECOVERY - puppet last run on cp3034 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:51:53] RECOVERY - puppet last run on cp3035 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:52:04] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:52:10] hmm my deploy earlier didn't get logged [19:52:25] where were you, morebots ? [19:52:59] weiird [19:53:34] RECOVERY - puppet last run on cp3049 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:53:42] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: Group1 wikis to 1.26wmf4 (actual time 18:12 UTC) [19:53:49] Logged the message, Master [19:53:57] there we go [19:54:04] RECOVERY - puppet last run on cp3039 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:54:24] RECOVERY - puppet last run on cp4020 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:54:24] RECOVERY - puppet last run on cp4012 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:54:34] RECOVERY - puppet last run on cp3019 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:55:23] RECOVERY - puppet last run on cp4002 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:55:43] RECOVERY - puppet last run on cp4009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:23] RECOVERY - puppet last run on cp3021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:23] RECOVERY - puppet last run on cp3018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:23] RECOVERY - puppet last run on cp3047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:25] RECOVERY - puppet last run on cp4011 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [19:57:53] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:53] RECOVERY - puppet last run on cp3017 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [19:57:53] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:58:04] RECOVERY - puppet last run on cp3038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:58:04] RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:58:25] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:03:20] 10Ops-Access-Requests, 6operations: Add niedzielski release-mobile and deployment-prep project - https://phabricator.wikimedia.org/T98179#1262260 (10Krenair) [20:03:24] 10Ops-Access-Requests, 6operations: Add niedzielski release-mobile and deployment-prep project - https://phabricator.wikimedia.org/T98179#1261519 (10Krenair) deployment-prep (labs) is very separate to release-mobile (prod)... [20:03:39] 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile and deployment-prep project - https://phabricator.wikimedia.org/T98179#1262264 (10Krenair) [20:06:03] PROBLEM - High load average on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [20:06:22] (03PS1) 10Ori.livneh: cpufrequtils: ensure configure governor is in use [puppet] - 10https://gerrit.wikimedia.org/r/209049 [20:06:24] ^ bblack [20:08:50] 6operations, 10Traffic: Fix cpufrequtils issues on jessie - https://phabricator.wikimedia.org/T98203#1262298 (10BBlack) 3NEW a:3BBlack [20:09:19] 6operations, 10Traffic: Reboot caches for kernel 3.19.3 globally - https://phabricator.wikimedia.org/T96854#1262309 (10BBlack) [20:09:19] 6operations, 10Traffic: Fix cpufrequtils issues on jessie - https://phabricator.wikimedia.org/T98203#1262308 (10BBlack) [20:10:19] 6operations, 7Shinken: Shinken hostname column is not large enough - https://phabricator.wikimedia.org/T1362#1262318 (10Dzahn) Try this: # install the ([[ https://addons.mozilla.org/en-US/firefox/addon/stylish/ | Stylish ]], extension [[ https://en.wikipedia.org/wiki/Stylish | about ]]) # install [[ http... [20:10:33] (03PS2) 10Krinkle: Add logmsgbot instance for #wikimedia-releng that listens to gallium [puppet] - 10https://gerrit.wikimedia.org/r/197386 (owner: 10Legoktm) [20:10:39] (03CR) 10jenkins-bot: [V: 04-1] Add logmsgbot instance for #wikimedia-releng that listens to gallium [puppet] - 10https://gerrit.wikimedia.org/r/197386 (owner: 10Legoktm) [20:10:51] ori: it's more complicated than that at least for the caches, see ticket above [20:10:58] 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1262326 (10Multichill) >>! In T87097#1262136, @Krenair wrote: > I think L2 is linked when they view Phabricator, it's a restricted visibility object. In the phabricator email, it has a link... [20:11:14] I'm pretty much stuck on that now until we get the non-trunk kernel installed + booted. it's due to land tomorrow or thurs. [20:11:46] hasharMeeting: YuviPanda|food: https://phabricator.wikimedia.org/T1362#1262318 [20:12:20] mutante: :D thank you! [20:12:24] 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1262334 (10Krenair) >>! In T87097#1262326, @Multichill wrote: >>>! In T87097#1262136, @Krenair wrote: >> I think L2 is linked when they view Phabricator, it's a restricted visibility object.... [20:12:33] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:12:51] ori: probably after the kernel reboot, I'll just drop the hacky cpufrequtils package there and set it some simpler way. you can do it with a bash one-liner, after all. [20:13:16] why drop the package? [20:13:43] because its whole purpose in these machines lives is to set the governor, and it can't [20:14:06] unless I go hack/fix it [20:14:13] RECOVERY - DPKG on labmon1001 is OK: All packages OK [20:14:23] (03PS7) 10Yuvipanda: [WIP]mesos: Add simple mesos module [puppet] - 10https://gerrit.wikimedia.org/r/208483 [20:14:45] but why bother when: "for x in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance >$x; done" will suffice? [20:16:21] bblack: isn't it installed by default? [20:16:35] and if so, doesn't that mean that you risk having the service configured to set one policy, and puppet to set another? [20:16:36] oh, I don't know, on jessie. it wasn't before on precise [20:16:44] 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1262348 (10Dzahn) >>! In T87097#1262326, @Multichill wrote: > In the phabricator email, it has a link to https://phabricator.wikimedia.org/L2, here it's just "L2". Must be very secret, I'm... [20:16:47] I remember having to add it to get them all set [20:17:03] wmf4 lauching today? [20:17:17] the service in any case would be default-configured to the right/matching policy, but not doing anything because it's functionally broken anyways [20:18:00] White_Master: wmf4 went to non-Wikipedias today (Commons etc). Will hit Wikipedias tomorrow. [20:18:56] greg-g, oh, thanks. :) [20:20:44] 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1262372 (10RobH) [20:21:05] 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10RobH) [20:23:24] bblack: the tool will fail if the governor is unavailable, whereas writing it into /proc will fail silently [20:23:28] White_Master: fyi: https://www.mediawiki.org/wiki/MediaWiki_1.26/Roadmap#Schedule_for_the_deployments [20:24:04] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [20:25:38] greg-g, yes, i look this page. I check 'cause i also update my wiki with those versions :P [20:26:48] ori: the tool needs upstream updates or us hacking its initscripts, etc. either way... [20:27:14] postponing until I have a working kernel to even try the alernatives on [20:27:17] *alternatives [20:30:22] (and we could make the manual method not-silent by having puppet check them, too) [20:30:41] but even manually applying "performance" is broken until update->reboot [20:30:57] 6operations, 6WMF-Legal, 6WMF-NDA-Requests: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#1262387 (10Dzahn) @multichill please try viewing that document again. after talking with chasemp i added you to the following group: https://phabricator.wikimedia.org/project/profile/974/ [20:40:10] 10Ops-Access-Requests, 6operations, 10Analytics: Access to stat1003 for jdouglas - https://phabricator.wikimedia.org/T98209#1262408 (10Krenair) [20:51:45] 6operations: install/setup server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1262429 (10RobH) a:5RobH>3akosiaris Alex, I wasn't sure if this needed trusty or jessie, so I put trusty on initially. I then committed the dhcp file change for jessie, so if trusty was wrong,... [20:52:01] 6operations: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1262431 (10RobH) [20:53:11] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1262436 (10chasemp) @johnlewis and @dzahn asked me to take look over the dumped DB from a sensitive information perspective. A few thoughts: * we should wipe the profile_setting tab... [20:56:26] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labvirt1005 memory errors - https://phabricator.wikimedia.org/T97521#1262455 (10Cmjohnson) Hi Christopher, This is Regarding the Case Number:4651331170 I have made arrangements to ship a replacement System board along with an onsite engineer. Part... [20:58:26] 6operations: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1262463 (10yuvipanda) So current puppetmasters are all precise, and on labs everytime we tried a trusty puppetmaster something or the other has blown up. Tread carefully, but it would indeed... [21:00:04] rmoen, kaldari: Respected human, time to deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150505T2100). Please do the needful. [21:10:32] 6operations: Replace mysql with mariadb on virt1000 (et al) - https://phabricator.wikimedia.org/T84470#1262540 (10Andrew) [21:18:39] twentyafterfour or greg-g, today’s train deploy seems to have broken search on wikitech, can you assist? [21:18:50] andrewbogott: ok [21:19:10] twentyafterfour: https://dpaste.de/LNE0 [21:19:59] “Cannot use Hooks as Hooks” <3 [21:20:26] weird .. [21:20:32] ori: legoktm ^ more FormatJson fun? [21:20:37] silver is running php and not hhvm [21:20:57] Krenair: ^ [21:22:20] (03PS1) 10John F. Lewis: Use Wiki.svg for wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209126 [21:22:40] andrewbogott: I'll poke it [21:22:52] twentyafterfour: thank you! Is the problem obvious? [21:23:14] I need to find that code... [21:23:26] not sure what changed, but it's not entirely obvious [21:23:58] I don’t know who the search people are these days. You? manybubbles? [21:24:08] what is up? [21:24:35] manybubbles: um… I probably paged you prematurely. twentyafterfour is working on https://dpaste.de/LNE0 (happening on wikitech right now.) [21:25:01] andrewbogott: ah. looks like fun versioning issues [21:25:23] Well, wikitech should only ever run version n-1 [21:26:28] wait so it got updated today to wmf4 and that shouldn't have until tomorrow? [21:27:06] twentyafterfour: I thought it lagged behind production by a point, is all. I could be confused. [21:27:15] It's a group 1 wiki [21:27:26] (03CR) 10Jalexander: [C: 031] "Verifying, WikimaniaWiki wants to update their logo. Ellie asked for my help to get it done quickly (before registration uploads and some " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209126 (owner: 10John F. Lewis) [21:27:27] Krenair: doesn’t that mean… what I said? [21:27:27] ok so then it correctly got updated today with group 1 [21:27:32] Group 1 wikis receive updates on Tuesdays [21:28:30] andrewbogott: there are 3 groups, group 0 1 and 2 ... group 1 is updated on tuesday, group 2 on wednesday (while group zero gets bleeding edge on wednesday) [21:28:41] hm, meanwhile, manybubbles do you know if/how search is monitored on production? should have an alert for this… [21:29:06] Oh, and group 0 doesn’t include production wikipedia, I take it? [21:29:34] 6operations, 10Wikimedia-Mailing-lists: move analytics-internal list to analytics-wmf - https://phabricator.wikimedia.org/T97618#1262675 (10kevinator) 5Open>3declined a:3kevinator In light of difficulty of getting this, I am canceling this task. Our team can live with keeping the list as it is. [21:29:37] andrewbogott: we monitor things like slow queries but I forget how we monitor that a search works. we must but I forget it [21:29:46] No. group 0 is a very small set of wikis (test, test2, wm.o and wikidatatest) [21:29:53] ok, I was confused then. [21:30:20] group1 is everything except the wikipedias [21:30:58] andrewbogott, https://www.mediawiki.org/wiki/MediaWiki_1.26/Roadmap is very clear [21:31:03] OK, so wikitech may be showing breakage that’s on deck for wikipedia. [21:31:10] you should probably bookmark it [21:32:04] andrewbogott: yeah ... so we need to fix. [21:32:30] Krenair: no matter how clear it is, it's still slightly confusing because of the way they overlap [21:35:05] (03CR) 10coren: [C: 032] "Trivial package addition." [puppet] - 10https://gerrit.wikimedia.org/r/209038 (https://phabricator.wikimedia.org/T98195) (owner: 10coren) [21:36:16] so ... there is already something in the CirrusSearch namespace named Hooks? [21:36:21] somehow [21:39:19] looks like maybe I164ad2dbcf8008b551288cab4c90bcbd0df33024 [21:40:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3555 MB (9% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 382635 MB (26% inode=99%) [21:41:26] 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile and deployment-prep project - https://phabricator.wikimedia.org/T98179#1262691 (10Dzahn) yea, but it got already separated from yet another thing: T97866, so that's good [21:45:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3107 MB (8% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 381630 MB (26% inode=99%) [21:45:49] (03PS1) 10Ori.livneh: Remove wmgUseBits setting, now that the migration is complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209130 [21:50:03] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [21:50:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 2678 MB (7% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 380722 MB (26% inode=99%) [21:52:34] (03PS1) 10coren: Add tbayer to researchers [puppet] - 10https://gerrit.wikimedia.org/r/209131 (https://phabricator.wikimedia.org/T97916) [21:53:00] 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile and deployment-prep project - https://phabricator.wikimedia.org/T98179#1262713 (10JohnLewis) I've added "niedzielski" to deployment-prep as a member. [21:53:56] Someone with more familiarity with cirrussearch want to take a look? I can't find any conflicting names in the cirrussearch code, so "import \Hooks" should be completely fine? [21:54:15] greg-g: Might go over the window a little today. Any objections? [21:54:29] 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile - https://phabricator.wikimedia.org/T98179#1262714 (10coren) [21:54:41] 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile - https://phabricator.wikimedia.org/T98179#1261519 (10coren) I've retitled the task accordingly. [21:55:09] oh, no I'm wrong. . [21:55:11] rmoen: the only issue is if anyone else is waiting, but I don't think so (no need to specifically ping me about that) [21:55:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 2263 MB (6% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 379674 MB (26% inode=99%) [21:55:26] greg-g: ok [21:55:35] twentyafterfour: what's up? [21:55:43] there is a global Hooks and a CirrusSearch namespaced \Hooks [21:55:52] 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski release-mobile - https://phabricator.wikimedia.org/T98179#1262717 (10coren) p:5Triage>3Normal [21:56:03] legoktm: fatal error on wikitech [21:56:08] traceback? [21:56:10] https://dpaste.de/LNE0#L2 [21:56:18] (pasted by andrewbogott ) [21:56:23] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60633 bytes in 0.766 second response time [21:57:00] 10Ops-Access-Requests, 6operations, 10Beta-Cluster: Add niedzielski releasers-mobile in production and deployment-prep in labs - https://phabricator.wikimedia.org/T98179#1262720 (10Krenair) [21:57:10] I think https://gerrit.wikimedia.org/r/#/c/207020/ is the culprit [21:57:28] yeah that's not going to work [21:57:51] global Hooks class collides with CirrusSearch/includes/Hooks.php class Hooks [21:58:04] so that needs to alias \Hooks to GlobalHooks or something [22:00:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1866 MB (5% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 378562 MB (26% inode=99%) [22:01:13] andrewbogott: ok, that was it. [22:02:00] twentyafterfour: cool. Not too late for swat, is it? [22:02:39] I don't know, rmoen are you still swatting? [22:03:01] twentyafterfour: Yes. Need a few more minutes [22:03:48] rmoen: we've got another patch that needs to go out, you wanna deploy it? https://gerrit.wikimedia.org/r/#/c/209135/ [22:04:15] twentyafterfour: hm, swat? [22:04:31] ? [22:04:41] thought that was in an hour? (unless I'm getting mixed up) [22:04:46] oh [22:04:52] I'm probably the one mixed up [22:05:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1509 MB (4% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 377274 MB (26% inode=99%) [22:05:20] twentyafterfour: Yeah I could do that. Unless swat is in an hour? [22:05:22] if people want to swat an hour early, I don't mind :) I have a patch in it :p [22:05:30] hah [22:05:46] twentyafterfour: I have to run now — thanks for sorting things! [22:05:56] andrewbogott: no problem [22:06:00] thanks for catching it [22:10:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1189 MB (3% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 375734 MB (25% inode=99%) [22:15:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 853 MB (2% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 374168 MB (25% inode=99%) [22:18:03] 10Ops-Access-Requests, 6operations, 6Editing-Department, 5Patch-For-Review: Give Neil Quinn access to stats1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97746#1262789 (10Neil_P._Quinn_WMF) @coren, I've signed the document. My public key is: ``` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDA6j1BKp5iz7VLQ... [22:19:12] mutante: who can +2 the dev.wikimedia.org redirect you +1d, https://gerrit.wikimedia.org/r/#/c/199182/ ? [22:19:57] ^^ i can't ssh into lutetium [22:20:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 384 MB (1% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 372795 MB (25% inode=99%) [22:20:29] (03CR) 10Jforrester: "Is this now good to go?" [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke) [22:20:52] spagewmf_: any ops who is willing to deploy apache [22:20:58] jgage: i think that's because it's fundraising [22:21:13] jgage: lemme try [22:21:16] arr i always forget whcih hosts are [22:21:21] wish they were in their own subdomain [22:22:54] jgage: yes, that's it. gotta bastion via tellurium [22:23:45] who is kbrownell? [22:23:48] !log apt-get clean on lutetium to free disk space [22:23:56] Logged the message, Master [22:24:06] jgage: ^ 1.3G free now [22:24:07] (03PS4) 10Spage: Redirect dev.wikimedia.org URLs [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) [22:24:16] thanks mutante [22:24:54] jgage: dunno, but somebody who works on FR's civicrm apparently [22:25:10] no match on staff & contractors page [22:25:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1227 MB (3% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 371355 MB (25% inode=99%) [22:26:41] hello "any ops who is willing to deploy apache", can you +2 https://gerrit.wikimedia.org/r/#/c/199182/ [22:27:00] jgage: that's a good point. it should really be mentioned somewhere. the user name _does_ show up in phabricator but only in comments it seems [22:27:32] jgage: found it. kbrownell works for Giant Rabbit https://phabricator.wikimedia.org/T83469#914434 [22:27:46] cool [22:28:01] frack-puppet:manifests/accounts_and_groups.pp [22:28:44] https://www.giantrabbit.com/client-list [22:28:49] ^ lists Wikimedia [22:29:53] <_joe_> mutante: maybe that's true? [22:29:59] (03PS5) 10GWicke: Add commons to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) [22:30:02] (03CR) 10GWicke: "Odd, must have forgotten to push the full change. Fixed now." [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) (owner: 10GWicke) [22:30:02] (03CR) 10GWicke: "Lets do #208193 first." [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke) [22:30:02] (03PS4) 10Yuvipanda: tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) [22:30:03] (03PS5) 10Yuvipanda: tools: Add check for long running precise / trusty jobs [puppet] - 10https://gerrit.wikimedia.org/r/208880 (https://phabricator.wikimedia.org/T97748) [22:30:04] <_joe_> they list civicvrm [22:30:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1150 MB (3% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 369879 MB (25% inode=99%) [22:30:25] mutante: is there a window wherein ops deploys apache changes? dev.wikimedia.org is not critical, I can wait for other apache changes [22:30:50] Just like James_F, I totally didn't write any code at all. [22:30:57] * James_F grins. [22:31:08] Unlike James_F, I wrote this in the wrong channel. [22:31:14] Indeed. [22:31:41] spagewmf_: no, i'm afraid not. i suggested there should be a SWAT or something because of this [22:35:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 987 MB (2% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 368644 MB (25% inode=99%) [22:40:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 804 MB (2% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366939 MB (25% inode=99%) [22:44:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [22:45:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 699 MB (1% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366540 MB (25% inode=99%) [22:46:16] jgage: there's an almost 10G slow query log there :/ [22:46:36] Move it To NFS! (LabsSolution(tm)) [22:46:49] jgage: is that why you asked for that user? [22:47:15] yuvipanda: ok. project "dispenser", instance "osm-tile-server-01" [22:47:25] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [22:47:53] PROBLEM - puppet last run on mw1242 is CRITICAL puppet fail [22:50:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 699 MB (1% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366540 MB (25% inode=99%) [22:52:06] !log rmoen Synchronized php-1.26wmf4/extensions/MobileFrontend/: Update MobileFrontend (duration: 00m 39s) [22:52:17] Logged the message, Master [22:52:20] !log gzip lutetium-slow.log on lutetium to save disk space [22:52:25] Logged the message, Master [22:52:44] !log rmoen Synchronized php-1.26wmf4/extensions/Gather/: Update Gather to master (duration: 00m 25s) [22:52:49] Logged the message, Master [22:53:23] !log rmoen Synchronized php-1.26wmf3/extensions/MobileFrontend/: Update MobileFrontend (duration: 00m 31s) [22:53:28] Logged the message, Master [22:54:10] !log rmoen Synchronized php-1.26wmf3/extensions/Gather/: Update Gather to master (duration: 00m 36s) [22:54:15] Logged the message, Master [22:55:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366540 MB (25% inode=99%) [22:55:30] Turns out i need to run scap since I cannot update the i18n with sync-l10nupdate-1 anymore ? [22:56:08] yeah scap is the way to update the prod l10n caches [22:56:32] ;/ [22:56:40] the l10n part is the bulk of the time in a full scap so it doesn't really cost you much [22:57:14] we really should add an option to only update a given branch at some point [22:57:16] 6operations, 10OpenStreetMap, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1262962 (10RobH) @yurik, I don't see any kind of task history linked into this; has there been an operations team member wo... [22:57:44] I just scapped yesterday afternoon though so it shouldn't be too bad. Probably ~25 minutes [22:57:48] ok [22:58:04] !log rmoen Started scap: Updates for Gather and MobileFrontend [22:58:11] Logged the message, Master [22:59:05] (03PS1) 10Ori.livneh: update mod_expires config for static/ [puppet] - 10https://gerrit.wikimedia.org/r/209145 [22:59:30] (03PS2) 10Ori.livneh: update mod_expires config for static/ [puppet] - 10https://gerrit.wikimedia.org/r/209145 [22:59:38] bd808: seems like section https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Alternative_to_scap needs updated .. since no more sync-l10nupdate-1 should this section be removed entirely ? [22:59:46] rmoen: eek, just fyi, starting scap 2 minutes before SWAT isn't the best, I didn't htink you'd go over an hour [23:00:05] RoanKattouw, ^d, bd808, James_F, legoktm, JohnLewis, twentyafterfour: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150505T2300). Please do the needful. [23:00:10] greg-g: I knows sorry. I had to though we have like 10 new messages [23:00:11] o/ [23:00:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366540 MB (25% inode=99%) [23:00:14] * greg-g nods [23:00:19] rmoen: s'ok [23:00:21] (03CR) 10Ori.livneh: [C: 032] "tested on mw1041" [puppet] - 10https://gerrit.wikimedia.org/r/209145 (owner: 10Ori.livneh) [23:00:27] 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1262969 (10yuvipanda) p:5Low>3Normal [23:00:51] looks like scap is going quick [23:01:02] * JohnFLewis peers in [23:01:04] I think I can I think I can [23:01:15] haha [23:01:44] 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1207452 (10yuvipanda) So right now, we have five shelves of disks, and ```/dev/mapper/store-now 40T 11T 30T 27% /srv/project``` So about 72% free. What's preventing us from moving to RA... [23:01:48] rmoen: Are you still scapping? [23:01:49] rmoen: [[How_to_deploy_code#Alternative_to_scap]] is no more. Thanks for pointing that out [23:02:14] RoanKattouw: only 3 minutes in [23:02:15] https://www.youtube.com/watch?v=qNVU23knqZw [23:03:57] (03CR) 10Ori.livneh: "What on earth is "sed "s/0"$'\b'"INFINITY/INFINITY/g""?" [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt) [23:04:58] RoanKattouw: sorry yes. .syncing apaches now [23:05:03] \\\\ [23:05:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366539 MB (25% inode=99%) [23:05:31] syncing apaches is still a thing? [23:05:39] you mean mw though, right [23:05:43] yes [23:05:48] ok [23:05:54] RECOVERY - puppet last run on mw1242 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [23:07:42] (03PS1) 10Dereckson: Enable NewUserMessage on bh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209146 (https://phabricator.wikimedia.org/T97920) [23:10:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366539 MB (25% inode=99%) [23:11:30] How are wikis defined as private, fishbowl, closed, small, medium, large, etc. wikis (for the purposes of configuration)? [23:11:36] 6operations, 10hardware-requests: order new array for dataset1001 - https://phabricator.wikimedia.org/T93118#1129802 (10RobH) Ordered, ETA 2015-05-18. [23:11:48] mutante: ^ [23:11:59] kaldari: .dblist entries? [23:12:25] oh [23:12:48] kaldari: Some of them are magically computed, but most are just plain old .dblist rows. [23:13:22] what James said, files called .dblist in mediawiki-config [23:13:29] James_F: I see now. Thanks! [23:13:35] E.g. don't look at group1.dblist unless you want to cry. [23:13:35] James_F: ;) [23:13:47] * James_F grins at ori, Disruptor of Worlds™ [23:14:25] James_F: why cry? an expression is surely better than an unspecified understanding that isn't encoded in software at all [23:14:57] ori: Whoa programmatic dblists, nice! [23:15:07] ori: So can we support comments in dblists now? :D [23:15:15] * RoanKattouw has wanted that since 2009 or so [23:15:15] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366539 MB (25% inode=99%) [23:15:40] https://noc.wikimedia.org/conf/highlight.php?file=group1.dblist :( [23:15:55] legoktm: Forgot the symlink, probably. [23:16:07] RoanKattouw: no, but it's trivial to add now, since all dblist file-loading is done in one place [23:16:31] James_F: where should the symlink go? [23:16:53] ori: Reminding myself right now. [23:17:03] there's a script to create them [23:17:19] Yeah. [23:19:04] ori: in docroot/noc/conf [23:19:28] (03CR) 10Tim Landscheidt: "@valhallasw: Moving into init.pp is okay; will do that." [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt) [23:19:42] ori: https://github.com/wikimedia/operations-mediawiki-config/blob/master/docroot/noc/createTxtFileSymlinks.sh [23:19:49] yes i was just staring at that [23:20:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=92%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 366539 MB (25% inode=99%) [23:22:49] (03CR) 10Spage: "This is OK to deploy but note the destination URL is likely to change later in April to a labs instance, and probably again in June to a n" [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [23:23:00] (03CR) 10Ori.livneh: "@Tim: That's amazing, hah. Could you add that link to a comment in that file?" [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt) [23:23:07] !log deleted 8G recurring_blocked.tsv from lutetium [23:23:23] Logged the message, Master [23:23:42] (03PS2) 10Tim Landscheidt: gridengine: Puppetize gridengine-mailer [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) [23:24:18] (03CR) 10Ori.livneh: "(hosting a .wikimedia.org URL on labs is a non-starter, IMO, because of cross-origin security issues.)" [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [23:24:51] (03CR) 10Ori.livneh: [C: 031] gridengine: Puppetize gridengine-mailer [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt) [23:26:21] Still waiting for scap to finish? :/ [23:26:57] (03PS3) 10coren: access: Remove Erik Moeller's Production Shell Access [puppet] - 10https://gerrit.wikimedia.org/r/208566 (owner: 10Matanya) [23:27:44] (03CR) 10coren: [C: 032] "Thanks for all the dedication and passion over the years, Erik. You're always welcome to get suckered into helping us again as a voluntee" [puppet] - 10https://gerrit.wikimedia.org/r/208566 (owner: 10Matanya) [23:28:14] JohnFLewis: sync-common: 99% (ok: 464; fail: 0; left: 1) [23:28:28] rmoen: how long has it been like that? [23:28:39] a while [23:28:49] * bd808 looks to see which is hung [23:28:56] greg-g: 30 minutes [23:29:04] .... [23:29:23] grrr.. my old enemy snapshot1004.eqiad.wmnet [23:29:33] 6operations, 5Patch-For-Review: Remove Erik Moeller's Production Shell Access - https://phabricator.wikimedia.org/T97864#1263085 (10coren) 5Open>3Resolved a:3coren After a chat with Erik, he has no intention to use his access as a volunteer in the short term and so agrees that it's wiser to turn it off f... [23:29:40] (03PS3) 10Yuvipanda: gridengine: Puppetize gridengine-mailer [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt) [23:29:52] (03CR) 10Yuvipanda: [C: 032 V: 032] "Thanks Tim!" [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt) [23:29:57] rmoen: open a second ssh session to tin and kill the ssh process you own there that is connecting to snapshot1004.eqiad.wmnet [23:30:05] ok [23:30:05] that will unstick the scap [23:30:20] !log snapshot1004.eqiad.wmnet hanging scap yet again [23:30:26] (03CR) 10Eevans: [C: 031] Add commons to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) (owner: 10GWicke) [23:30:27] Logged the message, Master [23:31:14] bd808: ok scap-rebuild-cdbs now [23:31:17] thanks [23:31:31] yw. I should have been paying closer attention [23:32:06] that snapshot1004.eqiad.wmnet box has been completely overloaded on a regular basis for more than a week [23:33:22] !log running sync-common on snapshot1004.eqiad.wmnet manually after it was aborted in scap by rmoen [23:33:29] Logged the message, Master [23:39:16] !log rmoen Finished scap: Updates for Gather and MobileFrontend (duration: 41m 11s) [23:39:23] Logged the message, Master [23:39:24] * rmoen claps [23:39:32] rmoen: next time wait for SWAT if it's that close :) [23:39:35] \o/ [23:39:45] greg-g: ok. my apologies [23:39:59] rmoen: or, have your team prep the patches before the deploy (just saw your email) and, well, test on Beta Cluster since it seems like that was possible with what you described [23:40:41] wasn't* [23:41:02] greg-g: yeah its all on me. I should have prepared prior. I'm sorta on light duty right now because of RSI issues ;/ so i take full blame on being unprepared today [23:41:39] s'alright, but I'm still worried about your team deploying code to enwiki that hasn't been tested (in the same states/checkout points) on beta cluster [23:41:57] s/worried/annoyed and disheartened [23:42:33] someday we will fix the pipeline so that's not even an option. someday [23:42:53] I agree. Checking out all the branches locally. But we should definitely be testing on beta cluster. I think we need to enable for test2? [23:42:56] So who's the unlucky swatter today? [23:43:11] yeah, so for now we're understaffed to do that and people work around it and take more risks than they should [23:43:12] Me [23:43:31] Oh look at that the scap is done [23:43:35] rmoen: Are you done now? [23:43:37] yes [23:43:47] RoanKattouw: mind if I ask for you to deploy my patch first if possible? [23:43:55] Sure [23:44:47] springle: thanks for the new s5 R710 :D [23:45:29] (03CR) 10Catrope: [C: 032] Use Wiki.svg for wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209126 (owner: 10John F. Lewis) [23:45:35] (03Merged) 10jenkins-bot: Use Wiki.svg for wikimania2015wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209126 (owner: 10John F. Lewis) [23:46:58] The difference is well noticeable [23:47:36] !log switched hadoop active namenode from analytics1001 to analytics1002 for rack C4 switch replacement tomorrow morning (T93730) [23:47:44] Logged the message, Master [23:49:15] !log catrope Synchronized wmf-config/InitialiseSettings.php: Use Wiki.svg for wikimania2015wiki logo (duration: 00m 19s) [23:49:20] Logged the message, Master [23:49:23] JohnFLewis: Done ---^^ [23:49:51] RoanKattouw: and confirmed no difference. Thanks :) [23:52:19] hoo: yw [23:57:21] !log aborted and restarted sync-common on snapshot1004.eqiad.wmnet manually after waiting 24 minutes with no progress [23:57:29] Logged the message, Master [23:58:06] anybody know why snapshot1004 gets so IO bound? [23:58:33] 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1263282 (10yuvipanda)