[00:00:15] 6operations, 6Labs, 10hardware-requests: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1286742 (10Andrew) [00:01:33] 6operations, 6Labs, 10hardware-requests: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1285990 (10Andrew) Let's call it labdns1003 [00:04:19] PROBLEM - HHVM queue size on mw1169 is CRITICAL 100.00% of data above the critical threshold [80.0] [00:04:20] PROBLEM - HHVM busy threads on mw1169 is CRITICAL 100.00% of data above the critical threshold [115.2] [00:07:03] !log elastic1017 es-tool restart-fast [00:07:09] Logged the message, Master [00:10:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [00:11:39] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1286807 (10bd808) From the deploy tools side of this, it should be fairly simple to add a command to... [00:12:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [00:13:49] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [00:14:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [00:16:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [00:19:49] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [00:23:49] (03PS1) 10Yuvipanda: Support --release param for backwards compatibility [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211066 [00:23:56] (03CR) 10jenkins-bot: [V: 04-1] Support --release param for backwards compatibility [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211066 (owner: 10Yuvipanda) [00:24:26] (03PS2) 10Yuvipanda: Support --release param for backwards compatibility [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211066 [00:25:10] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [00:26:19] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [00:28:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [00:31:50] (03CR) 10Yuvipanda: [C: 04-1] "+1 on the general approach." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/211059 (owner: 10Andrew Bogott) [00:37:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [00:37:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [00:41:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [00:42:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [00:43:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [00:47:10] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [00:50:10] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [00:50:21] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [00:54:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [00:55:47] (03PS1) 10Krinkle: Enable Interwiki extension in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211067 [00:56:03] grrrit-wm: ping, OK on deploy ^ ? [00:56:07] greg-g: ^ [00:56:18] Special:Interwiki basically [00:56:54] sure [00:57:02] (03CR) 10Krinkle: [C: 032] Enable Interwiki extension in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211067 (owner: 10Krinkle) [00:57:08] (03Merged) 10jenkins-bot: Enable Interwiki extension in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211067 (owner: 10Krinkle) [00:57:25] http://en.wikipedia.beta.wmflabs.org/wiki/Special:Interwiki [00:58:02] greg-g: Is it intentional that interwiki codes in beta labs point to production? [00:58:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [00:58:11] I noticed it earlier today and somewhat confused. [00:58:38] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 33735 MB (3% inode=99%) [01:02:02] eh, svn.wm.o has an expired SSL cert? [01:02:18] Krinkle: not sure, honestly [01:02:44] legoktm: Yeah, I filed that a few weeks ago. Ops seems to be refusing to fix because it's deprecated / to become redirected to the svn viewer in phab. [01:02:51] But even then it should have a valid ssl cert of the redirect. [01:03:08] https://phabricator.wikimedia.org/T98723 [01:03:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:03:48] I found https://phabricator.wikimedia.org/T88731 [01:04:13] Oh :) [01:04:32] 6operations: Move svn.wikimedia.org behind misc-web - https://phabricator.wikimedia.org/T98723#1286838 (10Krinkle) [01:04:40] 6operations: Move svn.wikimedia.org behind misc-web - https://phabricator.wikimedia.org/T98723#1275517 (10Krinkle) [01:04:53] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1286842 (10Legoktm) I just ran into this when someone linked to svn.wm.o in . I underst... [01:06:08] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1286844 (10Krinkle) >>! In T88731#1286842, @Legoktm wrote: > I just ran into this when someone linked to svn.wm.o in RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [01:06:20] legoktm: Hm.. I wonder if using misc-web would interfer with the svn protocol [01:06:31] no idea how that stuff works [01:06:36] probably? [01:06:47] The viewer needs to access its own server. [01:06:53] :/ [01:07:17] Though it should not have to use the fqdn to access it [01:07:22] 'localhost' would work fine [01:07:29] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1286846 (10Dzahn) >>! In T88731#1286844, @Krinkle wrote: >>>! In T88731#1286842, @Legoktm wrote: >> I just ran into this when someone linked to svn.wm.o in well there's the answer [01:09:14] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1018985 (10Krinkle) [01:10:08] legoktm: https://phabricator.wikimedia.org/rODNS046273c6585d8c04c2b9eeaae6ba7e6ba27e8d68 [01:10:20] legoktm: Hm.. any idea why the config update didn't work? [01:10:29] http://en.wikipedia.beta.wmflabs.org/wiki/Special:Interwiki?2 still broken [01:11:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:11:44] Krinkle: legoktm https://phabricator.wikimedia.org/T83443#914052 [01:12:40] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [01:12:53] !log elastic1018 es-tool restart-fast [01:12:57] mutante: I guess that failed to actually deploy? Because in prod svn wmo o is still broken cert [01:13:02] Logged the message, Master [01:13:05] Or has it been another year? [01:13:05] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1286854 (10Dzahn) also see https://phabricator.wikimedia.org/T83443#914052 [01:13:08] Ah, this was last year [01:13:25] Krinkle: it's the reason why moving it behind misc-web didnt happen back then [01:13:31] Yeah [01:13:34] Makes sense [01:13:49] I was deceived by the Conduit import date [01:13:50] "Via Conduit · Dec 18 2014, " [01:13:56] "Via Legacy · Jan 24 2014, 9:33 PM" [01:14:00] yea, it's an RT ticket [01:14:17] eh, not sure about the exact dates [01:14:32] mutante: so why don't we just buy a 1yr cert again? [01:14:41] is that super expensive? [01:15:08] legoktm: i dont know how much it is currently [01:15:36] i think there were more related tickets.. hmm [01:15:58] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [01:16:10] robh: remember discussion about the svn cert? [01:17:09] legoktm: Krinkle: so here, i tried this approach too: [01:17:11] https://phabricator.wikimedia.org/T86655#1097401 [01:17:22] consensus to disable the actual svn protocol [01:17:28] but got -1 [01:17:39] (03PS3) 10Andrew Bogott: Make the DNS server for .wmflabs configurable [puppet] - 10https://gerrit.wikimedia.org/r/211063 [01:17:41] (03PS4) 10Andrew Bogott: Ensure => present rather than 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/211060 [01:17:43] I think we still need that for migrating it to phab [01:17:43] (03PS6) 10Andrew Bogott: Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 [01:18:42] legoktm: It's already migrated to phab. Chad did that last month [01:18:54] no, it's just being mirrored [01:19:09] it still lives on svn.wm.o [01:19:12] https://phabricator.wikimedia.org/T83443#914082 [01:20:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:20:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:20:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:24:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [01:24:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [01:24:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [01:24:20] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:28:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:29:06] legoktm: mirrorred in what sense? [01:29:11] Does phab not have a complete copy? [01:29:29] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [01:29:30] import* [01:34:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [01:37:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [01:43:39] Krinkle: I'm not sure tbh [02:04:38] (03Abandoned) 10Dzahn: add class to install enchant and myspell packages [puppet] - 10https://gerrit.wikimedia.org/r/210846 (https://phabricator.wikimedia.org/T99030) (owner: 10Dzahn) [02:06:10] (03CR) 10Dzahn: "good points @ JohnLewis" [puppet] - 10https://gerrit.wikimedia.org/r/210838 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [02:06:22] (03Abandoned) 10Dzahn: WIP: deployment: make rsync_host configurable [puppet] - 10https://gerrit.wikimedia.org/r/210838 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [02:10:17] (03PS2) 10Springle: Add grants on centralauth.* via production-grants-core [puppet] - 10https://gerrit.wikimedia.org/r/210932 (owner: 10Hoo man) [02:11:19] (03CR) 10Springle: [C: 032] Add grants on centralauth.* via production-grants-core [puppet] - 10https://gerrit.wikimedia.org/r/210932 (owner: 10Hoo man) [02:12:51] !log elastic1019 es-tool restart-fast [02:13:00] Logged the message, Master [02:18:50] (03PS1) 10Dzahn: apt: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211069 [02:18:52] (03PS1) 10Dzahn: chromium: indentation fix [puppet] - 10https://gerrit.wikimedia.org/r/211070 [02:18:53] (03PS1) 10Dzahn: datasets: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211071 [02:29:59] !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 05m 39s) [02:30:09] Logged the message, Master [02:31:18] (03PS1) 10Springle: repool db1019; depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211072 [02:31:45] (03CR) 10Springle: [C: 032] repool db1019; depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211072 (owner: 10Springle) [02:31:50] (03Merged) 10jenkins-bot: repool db1019; depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211072 (owner: 10Springle) [02:33:16] !log springle Synchronized wmf-config/db-eqiad.php: repool db1019; depool db1053 (duration: 00m 13s) [02:33:22] Logged the message, Master [02:33:39] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [02:34:22] !log LocalisationUpdate completed (1.26wmf5) at 2015-05-15 02:33:18+00:00 [02:34:27] Logged the message, Master [02:34:30] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (90264s 90000s) [02:37:46] (03PS1) 10Springle: reassign db1053 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/211073 [02:39:34] (03CR) 10Springle: [C: 032] reassign db1053 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/211073 (owner: 10Springle) [02:42:29] !log upgrade db1053 trusty [02:42:35] Logged the message, Master [02:54:25] !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 04m 37s) [02:54:32] Logged the message, Master [02:55:51] !log xtrabackup clone db1057 to db1053 [02:55:56] Logged the message, Master [02:55:59] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL - Socket timeout after 10 seconds [02:57:39] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 7.675 second response time [02:58:02] !log LocalisationUpdate completed (1.26wmf6) at 2015-05-15 02:56:59+00:00 [02:58:08] Logged the message, Master [02:58:44] !backup [04:34:40] (03PS1) 10Springle: repool db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211076 [04:36:06] (03CR) 10Springle: [C: 032] repool db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211076 (owner: 10Springle) [04:36:13] (03Merged) 10jenkins-bot: repool db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211076 (owner: 10Springle) [04:37:49] !log springle Synchronized wmf-config/db-eqiad.php: repool db1018, warm up (duration: 00m 11s) [04:37:59] Logged the message, Master [04:51:44] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1287043 (10MZMcBride) >>! In T99136#1286122, @JohnLewis wrote: > The point is to finally standardise all mailing lists, a project that has been open for a few years now. Is that really a project? D... [04:53:59] 6operations, 3Roadmap, 10Wikimedia-Mailing-lists, 7notice, 7user-notice: Mailing list maintenance window - 2015-05-19 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T99098#1287045 (10MZMcBride) Will "scrubbing items" result in the archives being re-indexed and pipermail links breaking? [05:05:04] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1287050 (10ArielGlenn) a:3RobH [05:06:31] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1018985 (10ArielGlenn) @RobH, I gave this to you since svn can't move behind misc web after all. [05:10:09] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1287054 (10ArielGlenn) @Technical, I was watching the estimate provided at the en wp link I mentioned above. Given that it's only an estimate... [05:11:33] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1287055 (10ArielGlenn) @dr0ptp4kt has said he'll look into options 1 and 3. [05:13:17] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1287056 (10ArielGlenn) I'll +1 that (legal.wikimedia.org). Any objections? Speak up today or forever hold your peace etc. [05:18:37] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (12815 90000s) [05:20:47] the MediaWiki error count link in the topic is funny [05:24:54] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1287070 (10Heather) Off-phab it has been decided that legal.wikimedia is not appropriate because this is about all of Wikimedia, not just about the legal team. We're back to wikimedia.org/stopsurveillance [05:25:30] bd808: awwww, elasticsearch/plugins is provisioned via trebuchet for role::logstash::elasticsearch... [05:25:35] * yuvipanda doesn't want to setup trebuchet for tools [05:25:59] hmmm [05:26:37] technically the logstash cluster doesn't need any of the plugins [05:26:39] morning [05:26:56] but kibana is also trebuchet deployed [05:27:04] aaaahhhhh [05:27:04] o/ jynus [05:27:08] hi jynus :) [05:27:22] 23:27 here so... almost morning ;) [05:27:28] well, the compulsiveness to experiment with logstash in toollabs immediately just disappeared, so yay :P [05:27:40] (note the 'immediately') [05:28:22] yuvipanda: you know that trebuchet deployed pretty much == clone a git repo right? [05:28:36] bd808: don't I need to setup a tools-deploy host and associated salt stuff? [05:28:45] we need to refactor the ops/puppet deployment stuff actually [05:29:02] I mean, if it's just a git repo clone why not just use git::clone [05:29:03] right now you can't have a trebuchet master without also being a scap master [05:29:05] I use that elsewhere [05:29:26] yeah, I think I remember vaguely touching that code trying to set up staging-tin [05:29:31] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1287074 (10Bawolff) I thought that that was no longer an estimate when using redis job queue. [05:29:31] which is also why I don't want to go there again. [05:29:43] yeah... [05:30:02] I'll put that code on my list of "things to do on a trans atlantic flight" [05:30:07] heh :) [05:30:15] :-D [05:30:19] I don't even use salt for cmd.run on tools anymore. [05:30:24] shame! [05:30:28] on salt? :P [05:30:42] on you! [05:30:45] I have a short python script that generates hostgroups for me, and I've been really happy using pssh [05:31:01] apergos: the salt haters are winning (or at least grumbling) [05:31:07] ugh.... well to each their own [05:31:15] I'm not a salt hater - I'm sure it's quite usable if we take care of it properly [05:31:30] I can't say I'm a fan of it for a deploy tool backbone [05:31:33] but the last time I basically had to open multiple tabs on my terminal and ssh to them manually [05:31:44] apergos: it's way more flaky in labs than in prod. [05:32:02] apergos: it's not a 'to each their own' - it is genuinely unusable for remote command execution at the moment, at least in labs. [05:32:24] well when I di the labs upgrade it workedforme [05:32:35] and I used the hell out of it [05:32:56] if by 'to each their own' you mean 'it works for some people and does not execute on 50% of hosts for other people' then yeah, that's accurate. [05:32:57] people do have a habit of leaving their labs instances in bad states though [05:33:06] apergos: not really - this is toollabs, the instances are in good shape [05:33:21] e.g. "oh look we have ferm rules that prevent communication with the salt master" or "hey my packags are in a broken state" [05:33:37] not true at all for toollabs. [05:33:41] but except for such things I had no problem [05:33:52] if by 'to each their own' you mean 'it works for some people and does not execute on 50% of hosts for other people' then yeah, that's accurate :) [05:33:55] well maybe I should take a look at tool labs in particular [05:34:04] you totally should :) [05:34:15] or, if you like, you culd describe the isues you are having in toollabs on a ticket and add me as a subscriber [05:34:17] salt -G 'fqdn: tools-*' cmd.run hostname [05:34:28] sure I can do that [05:34:32] great [05:34:52] does tool labs have its own salt master(not virt1000)? [05:35:14] no [05:35:17] it doesn't. just virt1000 [05:35:26] ok great [05:35:42] yeah please do, I'm interested in making sure it works for everyone and works well [05:36:22] all joking aside. [05:36:42] 6operations, 6Labs: salt does not run reliably for toollabs - https://phabricator.wikimedia.org/T99213#1287107 (10yuvipanda) 3NEW [05:36:43] apergos: ^ [05:37:04] (and I use https://github.com/yuvipanda/personal-wiki/blob/master/tools-dsh-generator.py atm) [05:37:26] (03Abandoned) 10Yuvipanda: tools: make seperate /tmp hiera-configurable [puppet] - 10https://gerrit.wikimedia.org/r/210918 (https://phabricator.wikimedia.org/T99069) (owner: 10Merlijn van Deen) [05:37:38] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1287122 (10MZMcBride) What do we use Google Webmaster Tools for? [05:37:42] ok, thanks! [05:37:53] apergos: yw! [05:38:01] !log temporarily opening mysql port on firewall from db1009 to virt1000 [05:38:10] Logged the message, Master [05:38:11] that reminds me to update the config on virt1000 to do the 'ping after master key rotation' and see how we think that works [05:38:37] apergos: bblack was having similar issues with prod earlier. I think he tried a general clusterwide cmd.run to see if it works and got... 2 machines to respond totally or something [05:38:45] yep [05:38:55] this is a known issue, one of the reasons I did the upgrade [05:38:58] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1287123 (10Jalexander) One warning about #1 though it is still good to look into it since I haven't actually tried. My understanding from the documentation is that we can't actually 'delete' a site from our count... [05:39:06] PROBLEM - puppet last run on cp4012 is CRITICAL puppet fail [05:39:10] at this point a lot of people have had very negative experiences with salt command execution and I guess don't even bother reporting bugs because it's SNAFU... [05:39:14] apergos: this was earlier today... [05:39:16] so we woul dbe able to turn on this setting. I need to test it some before we do it on prod saltmaster [05:39:40] haven't turned on the setting yet, patience young grasshopper! only yesterday I got to close the upgrade ticket :-D [05:39:48] ok :) [05:40:15] I don't have skin in the game anymore - I'm going to keep at pssh for now. It also gives me clusters which salt does not (on labs) [05:40:38] (for toollabs - I'll still have to use salt for general labs-wide maintenance) [05:40:54] 6operations, 6Labs: salt does not run reliably for toollabs - https://phabricator.wikimedia.org/T99213#1287124 (10ArielGlenn) as a first step I need to turn on the config setting on the master that forces a ping of all clients after salt master key rotation (which happens every 24 hours or after any key is del... [05:42:13] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1287126 (10Jalexander) >>! In T99132#1287122, @MZMcBride wrote: > What do we use Google Webmaster Tools for? My main use case is communicating with google through the automated means. It's how we get reports of RT... [05:43:31] 7Puppet, 6operations, 10Deployment-Systems, 10Staging: provider => trebuchet doesn't work until manual 'git deploy start' on deployment-server - https://phabricator.wikimedia.org/T92978#1287128 (10ArielGlenn) A run of salt my-deployment-server-here deploy.deployment_server_init will do the trick once I ge... [05:55:16] RECOVERY - puppet last run on cp4012 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [05:59:47] (03PS1) 10Yuvipanda: Read default webservice class type from Service Manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [06:01:40] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1287134 (10Nemo_bis) > And, for what it's worth, still template edits from back as far as April 19th that haven't filtered through. I don't kn... [06:06:54] (03PS2) 10Yuvipanda: Read default webservice class type from Service Manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [06:07:00] (03CR) 10jenkins-bot: [V: 04-1] Read default webservice class type from Service Manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [06:07:45] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1287141 (10JohnLewis) >>! In T99136#1286503, @Legoktm wrote: >>>! In T99136#1286321, @JohnLewis wrote: >> Archives will not be broken. People's filters will also not necessarily be broken as people... [06:10:37] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 34748 MB (3% inode=99%) [06:10:56] (03PS3) 10Yuvipanda: Read default webservice class type from Service Manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [06:11:02] (03CR) 10jenkins-bot: [V: 04-1] Read default webservice class type from Service Manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [06:13:32] 6operations, 3Roadmap, 10Wikimedia-Mailing-lists, 7notice, 7user-notice: Mailing list maintenance window - 2015-05-19 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T99098#1287154 (10JohnLewis) >>! In T99098#1287045, @MZMcBride wrote: > Will "scrubbing items" result in the archives being re-i... [06:19:24] (03PS4) 10Yuvipanda: Read default webservice class type from Service Manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [06:19:31] (03CR) 10jenkins-bot: [V: 04-1] Read default webservice class type from Service Manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [06:23:22] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 15 06:22:19 UTC 2015 (duration 22m 18s) [06:23:29] Logged the message, Master [06:24:19] (03PS5) 10Yuvipanda: Read default webservice class type from Service Manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [06:26:04] (03PS6) 10Yuvipanda: Read default webservice class type from Service Manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [06:27:43] (03PS7) 10Yuvipanda: Read default webservice class type from Service Manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [06:29:37] PROBLEM - puppet last run on mw1213 is CRITICAL puppet fail [06:29:57] PROBLEM - puppet last run on cp1058 is CRITICAL puppet fail [06:30:47] PROBLEM - puppet last run on db1021 is CRITICAL Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on cp4019 is CRITICAL puppet fail [06:31:27] PROBLEM - puppet last run on db1034 is CRITICAL Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 2 failures [06:32:17] PROBLEM - puppet last run on db1051 is CRITICAL Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on cp4004 is CRITICAL Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on wtp2012 is CRITICAL Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:33:17] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:33:36] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [06:33:57] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:33:58] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:34:27] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 2 failures [06:34:36] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:34:47] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 1 failures [06:46:27] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on db1051 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:06] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on db1021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:28] RECOVERY - puppet last run on mw1213 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:47:37] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:47:46] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:47] RECOVERY - puppet last run on db1034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:48] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:47:57] RECOVERY - puppet last run on wtp2012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:57] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:57] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:48:37] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:17] RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:17] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [07:03:06] PROBLEM - HHVM busy threads on mw1169 is CRITICAL 100.00% of data above the critical threshold [115.2] [07:03:06] PROBLEM - HHVM queue size on mw1169 is CRITICAL 100.00% of data above the critical threshold [80.0] [07:06:07] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.75% of data above the critical threshold [1000.0] [07:08:14] (03PS1) 10Ori.livneh: Don't send profiling data to graphite for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211085 [07:09:06] PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail [07:11:17] PROBLEM - HHVM queue size on mw1169 is CRITICAL 100.00% of data above the critical threshold [80.0] [07:11:17] PROBLEM - carbon-cache write error on graphite1001 is CRITICAL 44.44% of data above the critical threshold [8.0] [07:12:07] (03CR) 10Filippo Giunchedi: [C: 031] Don't send profiling data to graphite for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211085 (owner: 10Ori.livneh) [07:12:53] (03CR) 10Ori.livneh: [C: 032] Don't send profiling data to graphite for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211085 (owner: 10Ori.livneh) [07:12:56] PROBLEM - HHVM busy threads on mw1169 is CRITICAL 100.00% of data above the critical threshold [115.2] [07:14:00] !log ori Synchronized wmf-config/StartProfiler.php: I6a516a0da: Don't send profiling data to graphite for now (duration: 00m 11s) [07:14:07] Logged the message, Master [07:14:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:15:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:16:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [07:18:48] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [07:18:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 22.22% of data above the critical threshold [20000.0] [07:19:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:20:06] !log rm -rf /var/lib/carbon/whisper/MediaWiki/query_* on graphite1001 and graphite2001, as follow-up cleanup for I6a516a0da [07:20:11] Logged the message, Master [07:22:31] ah thanks ori, I idn't know what could be tossed from there [07:23:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [07:23:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:24:17] (03PS1) 10Giuseppe Lavagetto: ipsec: fix class name [puppet] - 10https://gerrit.wikimedia.org/r/211087 [07:25:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [07:26:00] (03PS2) 10Giuseppe Lavagetto: ipsec: fix class name [puppet] - 10https://gerrit.wikimedia.org/r/211087 [07:26:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ipsec: fix class name [puppet] - 10https://gerrit.wikimedia.org/r/211087 (owner: 10Giuseppe Lavagetto) [07:27:07] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:27:16] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:28:46] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [07:28:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:29:02] <_joe_> looks like more ulsfo troubles? [07:31:23] (03PS8) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [07:32:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [07:33:06] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 34328 MB (3% inode=99%) [07:33:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [07:37:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:41:09] (03PS9) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [07:41:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [07:41:27] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 34643 MB (3% inode=99%) [07:42:58] <_joe_> godog: ^^ why is this happening? [07:43:42] I think not all mw got the config (yet?) so new xhprof metrics are still being created (cc ori) [07:44:07] ah [07:45:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:47:18] godog: looking [07:50:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [07:50:47] <_joe_> godog: is it possible to see which hosts are submitting those stats? [07:52:38] (03PS10) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [07:52:53] (03CR) 10jenkins-bot: [V: 04-1] Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [07:53:18] (03PS11) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [07:53:56] _joe_: yup looking [07:54:33] (03PS12) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [07:57:17] PROBLEM - puppet last run on cp3018 is CRITICAL puppet fail [07:58:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:58:43] (03PS13) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [07:59:28] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1287297 (10Nemo_bis) >>! In T99136#1286122, @JohnLewis wrote: > The point is to finally standardise all mailing lists, a project that has been open for a few years now. That's a very commendable go... [08:00:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:01:26] (03PS14) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [08:01:46] a bunch of appservers in eqiad really, when does config gets reloaded? [08:01:59] <_joe_> godog: can you name a couple? [08:02:13] <_joe_> godog: in theory, as soon as the code is synced [08:02:54] mw1139 mw1214 mw1194 mw1189 mw1206 [08:02:56] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [08:03:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [08:03:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:03:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:03:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [08:05:10] (03PS1) 10Giuseppe Lavagetto: ipsec: fix hiera data [puppet] - 10https://gerrit.wikimedia.org/r/211090 [08:06:34] <_joe_> godog: still coming from 1139? [08:06:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [08:06:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [08:07:28] (03Abandoned) 10Mjbmr: Add autopatrolled for fawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [08:07:30] _joe_: nope doesn't look like it [08:07:36] <_joe_> btw, a lot of cirrussearch errors as far as I can see [08:08:08] hmm, given that I have the hash of a file, and I want to write to said file *only* if hash hasn't changed, *and* I want to write to it even if it doesn't exist: how do I do this in a race-free way? [08:08:23] there's a slight trend downwards in received metrics but not as quick as I thought [08:08:27] I need to read file to calculate hash, but also need it created if it doesn't exist, so r+ isn't going to work. [08:08:28] * yuvipanda thinks [08:08:37] RECOVERY - Disk space on graphite2001 is OK: DISK OK [08:09:01] hmm, I could have r+ fail if file doesn't exist. [08:09:03] and then just write. [08:09:19] yuvipanda: if it exists you lock it and check, if it doesn't create exclusively [08:10:03] hmm, right - I guess I am just inventing my own optimistic locking [08:10:26] <_joe_> godog: so I think it's a statcache failure to detect changes on disk [08:10:32] <_joe_> I just touched that file [08:10:34] paravoid will look at the ulsfo issue in a few minutes [08:10:51] _joe_: ow :( [08:11:14] <_joe_> so, let's repeat the experiment, is 1206 still sending metrics? [08:11:49] <_joe_> or, lemme try one thing [08:12:04] _joe_: nope I'm seeing only regular metrics so far [08:12:22] <_joe_> godog: still seeing the stats, though? [08:12:55] _joe_: heh I meant only non-xhprof metrics which are sent regardless [08:13:20] hmm, so flock won't work because I'm not keeping my file handle open (nor do I want to) and it's not optimistic locking anyway... [08:13:23] * yuvipanda might be overthinking this [08:13:24] <_joe_> no I mean, you still have hosts sending xhprof metrics? [08:13:30] * yuvipanda lets people debug in peace for now [08:13:40] <_joe_> yuvipanda: look at what puppet does [08:13:46] yuvipanda: flock ;) [08:13:49] thrash everything and make people feel horrible? [08:13:50] <_joe_> I'm sure you can extract some antipattern from that [08:14:07] yuvipanda: postgres code!!!! [08:14:13] oh shit, yes [08:14:15] yuvipanda: :P [08:14:22] lol [08:14:23] <_joe_> godog: ^^ still seeing xhprof stats at all? [08:14:34] _joe_: sharp decline, double checking [08:14:48] <_joe_> ok if they're still there, I'm gonna try a thing [08:15:04] I am gonna turn into poe's raven and move around your head croaking "postgres! moar" [08:15:06] yeah mw1010, mw1016 for example [08:15:37] RECOVERY - puppet last run on cp3018 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:15:41] !log oblivian Synchronized wmf-config/StartProfiler.php: Null-sync to touch the file (duration: 00m 12s) [08:15:47] Logged the message, Master [08:15:53] <_joe_> now check again in 1 minute [08:16:28] (03PS1) 10Yuvipanda: [WIP] postgres: Provision credentials for all users / services [puppet] - 10https://gerrit.wikimedia.org/r/211091 [08:16:29] akosiaris: ^ [08:16:44] akosiaris: lots of todos [08:17:00] akosiaris: I can do all of them except > - Create a function that allows said accounts to create databases [08:17:14] and don't worry about the file name, that was its original purpose but not anymore [08:18:07] (03PS2) 10Yuvipanda: [WIP] postgres: Provision credentials for all users / services [puppet] - 10https://gerrit.wikimedia.org/r/211091 [08:18:18] yuvipanda: ok, thanks. I 'll pick it up from here [08:18:26] _joe_: some stragglers but practically it is over [08:18:32] akosiaris: thanks! and sorry about the superlong delay. [08:18:44] yuvipanda: no worries. Should you be in bed btw ? [08:18:47] I guess I took care of the one person who was bugging me (halfak) and everyone else was just bugging *you* [08:18:50] * godog figures yuvipanda and akosiaris team-tagging with highfive [08:18:56] godog: lol [08:19:00] akosiaris: yes, I should be, but I'm in the office and I am too lazy to get up and get on public transport... [08:19:03] I'll go shortly [08:19:20] akosiaris: we should do that at the next offsite! [08:19:22] what public transport ? isn't it like 01:19 ? [08:19:28] yeah, there are buses all night [08:19:32] really ? [08:19:36] impressed [08:19:37] yeah, just less frequent [08:19:41] at least on the route I am [08:19:52] the trains stop running tho [08:20:26] (03PS2) 10Giuseppe Lavagetto: ipsec: fix hiera data [puppet] - 10https://gerrit.wikimedia.org/r/211090 [08:20:33] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ipsec: fix hiera data [puppet] - 10https://gerrit.wikimedia.org/r/211090 (owner: 10Giuseppe Lavagetto) [08:21:29] !log reenabling icinga check for MySQL on db1009 [08:21:36] Logged the message, Master [08:22:17] (03PS15) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [08:24:18] (03PS16) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [08:24:58] godog: mobrovac https://gerrit.wikimedia.org/r/#/c/211080/16/tools/common/tool.py line 54 is my hacky, almost-but-not-quite-race-free version. [08:25:05] I can't use flock because I don't want to keep the FD open all the time [08:25:24] the use pattern is 99% of the time it's just a read, with 1% of times there being writes as well, and these writes are well after the reads... [08:25:34] for some definition of 'well after'. [08:25:49] so mostly if I had read, it's ok for someone else to write, as long as I also do not write afterwards [08:26:02] I'm using a 'hash' to verify it this time, but mtime would work just as well and have just the same race problems [08:26:13] Coren: ^ unix programming quagmire if you want to look too :) [08:26:26] * Coren reads scrollback. [08:27:45] yuvipanda: Well, if you have a hash, the easy way to try for the write is to flock(); check the hash(); write(); then unlock. You don't need to keep a lock over the file just while you check so that it is "atomic" [08:28:13] yuvipanda: why not add a shared flock before f.read() and an exclusive one before the yaml dump [08:28:14] Coren: yeah the problem is I want to be able to write this even if the file isn't created yet. [08:28:38] yuvipanda: then just add another exclusive flock before the write in the except block [08:28:41] Coren: so I can't use r+ mode, and will have to reopen them... [08:28:54] You need to handle creation differently. Try to open() with O_CREAT|O_EXCL first - that only works if the file does not already exist and is atomic [08:29:04] mobrovac: mostly because I'm not keeping the fd open the entire time. I read and I close it immediately... [08:29:32] If that first open errors out with EEXIST then you know you can just open it for reading because it's already there [08:29:43] Coren: oh, hmm. but then is there a race between *that* failing and the file being deleted? [08:29:57] no, there isn't [08:30:15] yuvipanda: Not an important one. If the file gets unlink()ed it just creates a new one. [08:30:15] O_EXCL check is atomic [08:30:55] Right, O_CREAT|O_EXCL is guaranteed to give you a new file or fail. [08:31:35] Coren: aaah cool, but what mode do I use that both allows me to 1. read it so I can reverify the hash and 2. create it if it doesn't exist? [08:31:51] mobrovac: yeah but then it's a 'if that fails, then do something else' and the entire operation needs to be atomic. [08:32:38] yuvipanda: No, it doesn't have to. if(open(blah blah O_CREAT|O_EXCL)) { you have a new file } else if(open(blah blah)) { existing file, flock() then check hash } [08:32:39] hmm, maybe I can sidestep this entire question, just say 'do not mutate this file from an automated process not run manually by a user - if you do, then you get undefined behavior' and not run into this problem at all :) [08:32:57] Coren: oh, right. [08:38:29] yuvipanda: yeah if your app is the only one supposed to touch the file and it handles its own mutual exclusion then it is much simpler to stick "# hey you, don't touch this" at the top [08:38:59] godog: yeah, am wondering if I should enforce that because people don't read documentation or not :) [08:39:22] also writing is a scarce enough operation that there isn't going to be a perf penalty.. [08:40:18] <_joe_> is this file on NFS? [08:40:19] yuvipanda: perhaps some indication on what they should do to actually get the changes they want via some other method [08:40:28] <_joe_> because that would make the whole problem more interesting [08:40:29] _joe_: hahahaaaaaaaaaa [08:40:30] oh dear [08:40:31] it is [08:40:35] I completely forgot about that [08:40:36] <_joe_> AHAHAHAHAH [08:40:42] of course I did [08:41:05] so, my solution now is to basically go: 'do not mutate files without direct user input. kthkxbye' [08:41:14] <_joe_> it's nfs4, so it's /doable/ [08:41:24] <_joe_> but I never really looked into it [08:41:28] yes, but not worth it, I think.. [08:41:45] what I have right now is kinda race free, but 'kinda race free' is the worst kind of race free [08:42:14] _joe_: thanks for pointing *that* out [08:42:29] <_joe_> I decided that getting rid of NFS for everything was the right thing before I installed nfsv4 [08:43:15] heh [08:43:19] we'll get there, I think. [08:43:21] maybe not everything [08:43:25] but lot of things [08:43:27] Actually, nfs4 speaks flock() well. [08:43:44] yuvipanda: But there is a MUCH easier solution anyways. [08:44:32] yuvipanda: Don't overwrite the file, ever. Always write the updated file to a new one (say, filename~) and rename() the new one over the old one. Guaranteed atomic and the worse that can happen is that someone who writes to the file that shouldn't simply gets ignored. [08:44:39] (03CR) 10Jcrespo: [C: 032] --no-version-check to be used by default on pt-online-schema-change [software] - 10https://gerrit.wikimedia.org/r/210863 (owner: 10Jcrespo) [08:44:50] Coren: yeah, but that makes the second 'write' succeed. I want it to fail. [08:45:07] (03PS3) 10Jcrespo: --no-version-check to be used by default on pt-online-schema-change [software] - 10https://gerrit.wikimedia.org/r/210863 [08:46:03] Coren: but I think I'm looking for too clever a solution. Plastering a 'WARNING' on it now [08:46:16] That also works. :-) [08:46:27] yeah :) [08:46:47] I think it's one of those things where I started out as 'oh this should be easy enough' and then went to 'oh, I learnt new things' to 'oh wait WHY' [08:47:05] <_joe_> yuvipanda: I sense this is an XY problem [08:47:11] yeah [08:47:12] <_joe_> what are you really trying to do? [08:47:29] webservice start [08:47:31] when I run that [08:47:31] yuvipanda: also make sure to include some pointers to the user on how to do what they'd likely wanted to do, or you get cranky users [08:47:39] I want it to write an entry to 'web: lighttpd' [08:47:41] on service.manifest [08:47:53] <_joe_> yuvipanda: flock, then [08:48:07] but, there is other data there, and other tools in the future can write to it (crontab -l should, for example) [08:48:12] <_joe_> yuvipanda: or better, use an sqlite3 db [08:48:16] <_joe_> or berkleydb [08:48:24] <_joe_> but beware of those on NFS [08:48:26] * ori proposes crontab-l@wikimedia.org [08:48:40] ori: I think that's already there, but aliased as root@ [08:48:49] haha [08:48:51] <_joe_> ori: do you want to receive that? [08:48:53] _joe_: eeegh, no. that's what gridengine does and we all hate it for that. [08:48:56] !log Updated cxserver to 1cb6cec [08:49:05] _joe_: but I think the better solution is to just... not write to it without user interaction. [08:49:06] <_joe_> yuvipanda: but that's the correct solution [08:49:07] Logged the message, Master [08:49:12] _joe_: what, bdb? [08:49:22] <_joe_> or sqlite3, yes. [08:49:25] <_joe_> I'd use bdb [08:49:31] <_joe_> I'm actually a fan of sorts [08:49:39] so basically if you ran crontab -e and changed your webservice type at the same instant, they're going to clobber your service.manifest file [08:49:47] * yuvipanda makes _joe_ administer gridengine [08:49:52] s/adminster/puppetize/ [08:49:54] <_joe_> I think I once wrote a distributed-pickle for python that used bdb [08:50:19] this file probably gets one write every year, on average, if there's activity [08:51:02] godog: yeah, will do. [08:52:27] yuvipanda: Silly question: why not store the manifest in mysql? [08:52:37] it's two lines. [08:52:48] and the eventual pattern is, you just have it in your git repo [08:52:52] and that's it. [08:53:03] webservice and crontab -e modifying it are just shims for backwards compatibility [08:53:03] Ah, yes, git repo means a flat file is good. [08:53:14] _joe_: btw jobrunners looks like are straggling, tried to touch StartProfiler.php on e.g. mw1007 but didn't do much [08:53:28] <_joe_> godog: umh [08:53:34] yuvipanda: But I agree that given that use pattern, locking is heavy overkill. [08:53:38] yes [08:53:46] <_joe_> godog: maybe they lack the stat_cache setting [08:53:54] I think it's basically me going 'I wonder how one would do that' and then bam there's 4x more code than needs be [08:53:59] yuvipanda: I would do the tmp-file-rename though to ensure it is always /valid/ [08:54:01] <_joe_> godog: ahah, no, the reason for that are probably llong running jobs [08:54:19] yuvipanda: Always consistent > always correct [08:54:31] oh, hmm [08:54:40] so two processes opening it up and then just writing to it immediately? [08:54:48] and hence ending up with jumbled shit? [08:54:53] Right. [08:55:06] note that the current code that this is replacing basically goes 'hahaha' and just does a simple write :D [08:55:08] This way, you lose one of the changes but the end result is known to be valid. [08:55:14] right [08:55:27] It's not serious complexity, it adds exactly one operation. [08:55:40] yeah [08:55:43] You don't write to "foo", you just write to "foo~" then rename("foo~", "foo") [08:56:39] _joe_: hah, anything we can do? [08:57:06] <_joe_> godog: I dont know [08:57:21] <_joe_> if this is still causing graphite serious issues, I can look into it better [08:57:42] Coren: I was just going to mkstmp and then move it [08:58:08] _joe_: no not anything serious [08:58:25] <_joe_> ok then I'd pass for now [08:58:34] yuvipanda: That's not guaranteed to work at all; for rename() to work (or mv, atomically) you need to guarantee that both files are on the same filesystem - which is only certain if they are in the same directory. [08:58:40] oh bah [08:58:41] right [08:58:56] Hence the foo~ "convention" [08:59:00] * yuvipanda feels his programmer skills rusting [08:59:10] (03PS3) 10Faidon Liambotis: Fix completely broken SSH host key collection [puppet] - 10https://gerrit.wikimedia.org/r/210926 [08:59:14] (That is, it's not a convention but it's so common it might as well be) [09:00:17] PROBLEM - carbon-cache write error on graphite1001 is CRITICAL 22.22% of data above the critical threshold [8.0] [09:00:26] PROBLEM - HHVM queue size on mw1169 is CRITICAL 100.00% of data above the critical threshold [80.0] [09:00:26] _joe_: I take that back, it is causing problems [09:01:00] i'll restart the jobrunner on those [09:02:12] ori: what can we else clean up? hook_* and wf* ? [09:02:31] those are pretty damn important [09:02:34] Coren: hmm, I wonder if we're actually gaining anything with the service.manifest~ file [09:02:36] i mean, we can do without them [09:02:50] but they should be pretty low on the list of things we don't measure [09:02:52] hmm, I guess I should create it with O_CREAT | O_EXCL [09:02:59] yuvipanda: Yeah. :-) [09:03:01] because otherwise then two files can just fuddle service.manifest~ [09:03:12] ori: I guess my question is, what other metrics did xhprof generated we can clean up now? [09:03:19] yuvipanda: Sorry, that was implicit in my mind. :-) [09:03:50] Coren: :) I have to very rarely write programs that need to care about all these so it's not in RAM... [09:03:55] Coren: let me finish it up so you can take a look [09:04:03] * Coren nodsnods [09:04:55] !log restarted hhvm / jobrunner on jobrunners to force them to pick up I6a516a0da ; re-cleared /var/lib/carbon/whisper/MediaWiki/query_* on graphite1001 and graphite2001 [09:05:00] Logged the message, Master [09:05:10] e.g MediaWiki.run_init ? [09:05:29] i'd like to re-enable those as soon as we can, to be honest [09:05:41] the initialization time of mediawiki is pretty important, right? [09:06:03] I agree but there's what 4h of data in there? [09:06:04] 184G MediaWiki [09:06:07] 151G cassandra [09:06:12] (03PS4) 10OliverKeyes: Add fluorine rsync connector [puppet] - 10https://gerrit.wikimedia.org/r/209684 [09:06:39] I mean sure, go ahead and nuke them [09:07:39] !log rm MediaWiki.run_init from graphite1001 / graphite2001 [09:07:44] Logged the message, Master [09:08:27] PROBLEM - HHVM busy threads on mw1169 is CRITICAL 100.00% of data above the critical threshold [115.2] [09:08:36] PROBLEM - carbon-cache write error on graphite1001 is CRITICAL 22.22% of data above the critical threshold [8.0] [09:08:37] PROBLEM - HHVM queue size on mw1169 is CRITICAL 100.00% of data above the critical threshold [80.0] [09:09:31] Coren: hmm, so in case of a conflict, we'll leave behind a services.manifest~ file [09:09:37] Coren: I guess that's a good thing, and the user should clean up [09:09:54] err [09:09:55] not a conflict [09:10:06] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 3.33% of data above the critical threshold [1000.0] [09:10:45] hmm, so service.manifest~ will be left behind when the open succeeds but the rename fails for some reason [09:10:55] and then further commands will fail until the service.manifest~ file is cleaned up [09:10:59] that seems ok [09:11:57] (03PS17) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [09:12:27] further *write* commands [09:14:59] (03PS18) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [09:15:04] <_joe_> yuvipanda: how is your wheel building? [09:15:19] (03CR) 10coren: "Looks sane." (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [09:15:21] _joe_: wheel? [09:15:29] (03CR) 10coren: [C: 031] Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [09:15:36] <_joe_> the one you're reinventing :P [09:15:46] !log restart hhvm on mw1018, straggling [09:15:49] yuvipanda: 'tilda'? First time I see it spelt that way. [09:15:52] Logged the message, Master [09:16:12] Coren: oh wow, I didn't know it was spelt with an 'e' [09:16:17] have always heard it pronounced 'tilda' [09:16:32] _joe_: no, I decided I didn't actually need a wheel but a notice saying 'UGA ONLY WALK NO ROLL' [09:16:51] <_joe_> ok fair enough :P [09:16:57] yuvipanda: The final vowel is a shwa - if you've never seen it written down it's ambiguous. The error is quite understandable. [09:16:57] <_joe_> I'm just trolling you anyways [09:17:19] <_joe_> Coren: not in any latin language [09:17:28] _joe_: :P I need to understand 'unix programming' better, but that involves writing more code that is at that level, and I don't get opportunity to do that much [09:18:47] _joe_: English often have vowels reduce to /ə/, especially final vowels. [09:18:55] (03PS19) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [09:19:02] <_joe_> Coren: I know, we use "e" for that [09:19:31] _joe_: French speaker, remember? :-) I know. :-) [09:19:36] <_joe_> ahah right [09:19:42] <_joe_> you're all yankees to me [09:19:46] <_joe_> yuvi is too now [09:20:15] <_joe_> (and yes, I'm trolling you :)) [09:20:21] _joe_: don't worry, I'll be spending two months in Glasgow. [09:20:26] err [09:20:27] (03CR) 10Jcrespo: [V: 032] --no-version-check to be used by default on pt-online-schema-change [software] - 10https://gerrit.wikimedia.org/r/210863 (owner: 10Jcrespo) [09:20:27] 1 month [09:20:34] by the time I'm done even I won't be able to understand my accent [09:20:40] <_joe_> ahah [09:20:53] likely because you'll be drunk 99% of the time [09:21:25] <_joe_> yuvipanda: https://www.youtube.com/watch?v=nwc6BisdTBI [09:23:17] (03PS20) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [09:24:14] is it normal for gerrit temporarilly showing me the wrong diff? I was going crazy [09:24:27] <_joe_> jynus: not really [09:24:52] maybe it was me, that I was on the wrong tab [09:25:55] _joe_: not sure if I should be happy or sad that that entire video made full sense to me, and I didn't find the accent that bad. [09:26:15] <_joe_> yuvipanda: that is intended to sound understandable, of course [09:26:22] yeah, I figured. [09:26:42] (03CR) 10Merlijn van Deen: [C: 04-1] "It might make more sense to generalize this. Currently all" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211066 (owner: 10Yuvipanda) [09:28:01] valhallasw: I think you accidenta [09:28:32] (03CR) 10Merlijn van Deen: "Ignore cover message, acquire inline message." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211066 (owner: 10Yuvipanda) [09:28:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 8 below the confidence bounds [09:28:50] yuvipanda: 'we can genera... or we can wait for two more years until 16.04' [09:29:09] valhallasw: yeah, let's wait two years :) [09:29:09] yuvipanda: or for when we start doing jessie [09:29:20] I don't think we will be doing jessie while still on gridengine [09:29:24] *nod* [09:29:29] !log restart carbon on graphite1001 [09:29:29] why not? [09:29:30] valhallasw: also, even then, we should be using lighttpd-precise and lighttpd-jessie [09:29:36] Logged the message, Master [09:29:49] yuvipanda: yes, but how about the uwsgi/nodejs ones :P [09:29:56] paravoid: mostly because we haven't even finished precise migration because a lot of tools require 5.3 [09:30:02] oh, just add those and not use --release [09:30:03] valhallasw: well, I think we shouldn't mix 3 distros. [09:30:31] valhallasw: yup. there's already a lighttpd-precise - --release is documented and people have been using it for a while now, and so I kept it as 'backwards compat'. it is suppressed in argparse output [09:30:49] valhallasw: you are right, I should error some out tho. let me do that [09:30:55] yuvipanda: oh! but shouldn't we make sure to name the relesase in the other ones already, then? [09:30:58] err, error out the things that specify precise for things not lighttpd [09:31:01] yuvipanda: as in 'uwsgi-trusty'? [09:31:27] valhallasw: no, I think 'uwsgi' should be just the 'current default'. [09:31:55] valhallasw: and we can introduce variants and move them around when we have new ones. also, backwards compat again. [09:31:56] yuvipanda: mm, right, and then force-upgrade people at some point [09:32:11] 6operations, 10Traffic: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1287413 (10faidon) 3NEW [09:32:16] yuvipanda: backwards compat is not an issue, you can just solve that with your --release param [09:32:21] valhallasw: basically, yeah. for uwsgi I think people should use virtualenvs as much as possible anyway... [09:32:22] no, actually [09:32:38] no, people use 'webservice uwsgi-python restart' so we shouldn't break that [09:32:55] yuvipanda: --release param, again. The same thing you're doing for lighttpd.... [09:33:18] yuvipanda: basically, consider this setup.... [09:33:39] go on [09:33:41] yuvipanda: all webservice types have a release in the name explicitly, e.g. lighttpd-precise, lighttpd-trusty, etc. [09:33:58] yuvipanda: webservice X restart will try webservice type X and X+--release [09:34:13] yuvipanda: and we switch the default release by changing what --release has as default [09:34:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:34:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:34:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:34:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:34:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:34:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:34:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:34:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:35:10] yuvipanda: also your manifest is anti-http://12factor.net/config :P [09:35:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:35:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:35:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:35:27] valhallasw: hmm, so this makaes migration easy in the future, but also means I've to do a migration right now - service.manifest files all around have web: lighttpd in them [09:35:41] yuvipanda: hm, that's inconvenient as well. [09:35:42] valhallasw: that's such a strawman I'm not going to respond to that :P [09:35:52] valhallasw: yeah, so it's basically 'migrate now' vs 'migrate when we add new distro' [09:35:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:36:03] valhallasw: and I'm inclined to do the later. [09:36:11] yuvipanda: however, we *should* add an explicity trusty option [09:36:25] yuvipanda: for people who have compiled stuff, for instance, because they know they are linked to a specific distro [09:36:37] RECOVERY - carbon-cache write error on graphite1001 is OK Less than 1.00% above the threshold [1.0] [09:36:38] yuvipanda: which is, by the way, everyone with a virtualenv, because venvs break on distro upgrades [09:36:43] yeah [09:36:55] well, in general it shouldn't but IIRC trusty changed something in pip [09:37:02] (if you're using only pure python modules) [09:37:09] I'm pretty sure the issues were not in pip :P [09:37:22] venvs are not portable, never were supposed to be [09:38:01] I remember seeing an ubuntu bug about it, but maybe it was a poor soul who assumed otherwise. [09:38:15] anyway, this discussion should be in a separate patch / bug and not on that patch, I think [09:38:36] since that one is specifically just about making sure we don't break users' current commandline habits [09:38:40] *nod* [09:39:09] valhallasw: https://gerrit.wikimedia.org/r/#/c/211080/ is a bit more meaty, and currently suffering from a subtle bug [09:39:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [09:39:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [09:39:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [09:39:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [09:39:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [09:39:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [09:39:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [09:39:56] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [09:39:57] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [09:40:07] RECOVERY - Disk space on graphite1001 is OK: DISK OK [09:40:09] yuvipanda: precise venv on trusty: ImportError: No module named datetime [09:40:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [09:40:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [09:40:34] I was seeing bugs like https://bugs.launchpad.net/ubuntu/+source/python-pip/+bug/1375357 but I guess that's just a symptom of 'everything breaks' [09:40:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [09:40:36] he [09:40:37] h [09:41:09] yuvipanda: that's a sudo pip one, so not a venv :-) [09:41:21] that's just one I found now [09:41:33] and I guess I should stop trying to write code at 2AM. [09:41:43] :D [09:43:39] (03PS21) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [09:45:21] (03PS1) 10Dereckson: Imported logo for Wikimedia User Group China [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211094 (https://phabricator.wikimedia.org/T98676) [09:46:34] (03PS2) 10Dereckson: Imported logo for Wikimedia User Group China [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211094 (https://phabricator.wikimedia.org/T98676) [09:49:22] (03PS22) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [09:49:24] (03PS3) 10Yuvipanda: Support --release param for backwards compatibility [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211066 [09:49:29] valhallasw: ^ [09:50:42] valhallasw: you're right about lighttpd should properly be lighttpd-trusty. I wonder if we can capture / specify this info in ways that isn't just a string, though. [09:51:22] (03CR) 10Merlijn van Deen: [C: 04-1] "needs debian python-yaml dependency" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [09:51:28] yuvipanda: *nod* [09:51:48] yuvipanda: could you stop pushing a gazillion patchsets :P [09:52:01] valhallasw: that's how I test them :P push to gerrit, then pull [09:52:11] (03CR) 10Merlijn van Deen: Add support for service manifests (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [09:52:14] valhallasw: also, no, it doesn't need python-yaml dependency. that's automatically figured out by pybuilder [09:52:16] > Depends: python (>= 2.7), python (<< 2.8), python:any (>= 2.7.1-0ubuntu2), python-yaml [09:52:22] on the package [09:52:40] I assume you mean dh_python? [09:55:19] yuvipanda: the deb package I generate certainly doesn't have that Depends: line [09:55:29] oh wait [09:55:38] that was the old 0.1-1 [09:55:42] (03PS23) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [09:55:48] (03CR) 10Yuvipanda: Add support for service manifests (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [09:55:57] (03CR) 10jenkins-bot: [V: 04-1] Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [09:56:31] (03PS24) 10Yuvipanda: Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 [09:57:08] valhallasw: ah, ok. [09:57:22] yuvipanda: ok, not sure why it works, but it works [09:57:32] yuvipanda: basically, the dh_python2 docs say 'we only do requires.txt' [09:57:59] hmm [09:58:06] > dh_python2 and dh_python3 will correctly fill in the installation dependencies (via ${python:Depends and ${python3:Depends respectively), but it cannot fill in the build dependencies. Take extra care in getting this right, and double check your build logs for illegal access to pypi.python.org. [09:58:09] from https://wiki.debian.org/Python/Pybuild [09:58:33] yuvipanda: yeah, but that doesn't tell you where it gets them from [09:58:40] yuvipanda: man dh_python2 is pretty explicit about that [09:58:59] ohhh [09:59:00] ohhhh! [09:59:07] requires.txt in egg-info [09:59:26] everything makes sense now [09:59:44] setup.py sdists makes the egg-info dir, which contains requires.txt, which is read by dh_python2 [09:59:46] good! [09:59:59] sweet :D [10:00:16] I definitely like pybuilder better than stdeb [10:00:19] err [10:00:20] pybuild [10:01:10] yuvipanda: as for the default webservice [10:01:17] isn't that supposed to be just for bigbrother? [10:01:42] valhallasw: no, it basically means if I did 'webservice nodejs start' once, next time I can just do 'webservice restart' and it'll just work [10:02:03] mmm. [10:02:03] and not try to start a lighttpd server [10:02:06] but not stop and start [10:02:16] no, it works for stop too [10:02:24] but not for start because you need to tell it waht to start [10:02:27] no, you delete the web entry on stop [10:02:29] since stop removes the web: entry [10:02:33] yeah, that's what I mean [10:02:40] yeah, but you don't need webservice nodejs stop [10:02:43] webservice stop will do [10:02:59] mm, right [10:03:14] yuvipanda: also, probably add a comment to the manifest file that one should not edit manually? [10:03:47] valhallasw: hmm, I'm not so sure (yet) about that. also, I don't know how pyyaml is going to deal with roundtripping comments [10:03:51] probably not very well [10:03:57] yuvipanda: you just need to add it on write [10:04:24] maybe in another patch :) Also, I think if I add crontab -e functionality to service.manifest, it's ok for people to edit [10:04:30] no, it's not [10:04:35] because a) you'll kill their comments [10:04:41] and b) they might edit the web manifest [10:04:53] either it's human-editable, or computer-generated. not both. [10:05:04] it's human editable with a computer helper for backwards compat [10:05:06] either way [10:05:12] that's a discussion for when we come to doing crontab -e [10:05:13] and not for now [10:05:34] " it's human editable with a computer helper for backwards compat" ??? [10:05:42] and no, adding that warning comment /is/ something to do now [10:06:17] yuvipanda: you can just write "# this file is computer-generated; do not edit manually\n" to the file object, then let pyyaml write [10:06:37] valhallasw: please make a patch if you feel strongly about it. I don't. [10:06:42] I'll happily review and merge. [10:06:50] well, I don't at least right now. [10:07:38] it's a single line in an editor you already have open... but fine. [10:07:54] valhallasw: actually, let me recant that. I think eventually service.manifest is something that people should edit manually [10:07:58] and feel ok with editing manually. [10:08:04] yuvipanda: then we shouldn't write to it [10:08:22] ok, so come up with a sollution that makes webservice start stick and webservice stop not try to start it back up again. [10:08:33] s/come up with/we need an alternative solution that/ [10:08:35] yuvipanda: then we should just print 'to make this webservice default, please add this line to your manifest.' [10:08:42] as for stopping the right one... [10:09:00] valhallasw: and then you're back to bigbrother and why 'opt in' for 'keep my webservice running forever' is cause for heartburn [10:09:26] that's true [10:09:27] valhallasw: it's not for stopping the right one - bigbrother was impossible to *opt out* of. [10:09:46] valhallasw: restarting .bigbrotherrc had no effect, you had to poke an admin to restart bigbrother itself. [10:09:52] *nod* [10:09:57] so webservice stop should just stop the webservice [10:10:03] and not have something else starting it back up... [10:10:38] so you're totally right that as is, it isn't human editable, because that just causes problems, but we need to carefully think about them and then figure out what to do [10:10:41] yuvipanda: I see two options. A .webservice_status file which contains basically what you now have in the manifest [10:10:43] yuvipanda: orrrrr [10:11:13] yuvipanda: in the manifest, use something that doesn't munge comments, and add a line 'status: ' after the 'web: type: ....' [10:11:24] another option would be [10:11:29] to have a 'manifest reload' [10:11:33] or something along those lines [10:11:41] that just interprets the service manifeset and does the right thing [10:11:48] so if you changed 'web: lighttpd' to 'web: nodejs' by hand [10:11:53] you can do a 'manifest reload' [10:12:13] and then the webservice commandline client can just be implemented as 'mutate service.manifest and then call manifest reload' [10:12:15] meh, that's also complicated [10:12:27] yeah but it keeps service.manifest clean [10:12:37] I'd just go for a .webservice_status file instead of a manifest, which is used for restart/stopping the service [10:12:42] and a manifest for the default start action [10:12:45] not as clean :) [10:12:56] manifest is also used for tools-manifest, remember. [10:13:01] in fact that was its primary use case [10:13:05] tools-manifest is a terrible name [10:13:08] webservicemonitor [10:13:12] is the actual service. [10:13:14] yeah, so? [10:13:24] that one should just check .webservice_status [10:13:26] shit, it's way past 3AM and I'm still in the office. [10:13:31] but it's confusing either way [10:13:39] I'd just make the manifest non-human-editable for now [10:13:50] no, 'temporary things tend to stick around permanently' [10:13:54] so I am going to disagree on that [10:14:03] and let it be. [10:14:15] and let me go home as well :) [10:14:29] yuvipanda: .... [10:14:56] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Service_manifests needs expansion [10:15:08] and then we'll mention it there that you shouldn't hand edit them for now. [10:15:27] I really don't see the issue of adding a comment to the file [10:15:36] valhallasw: have you seen http://etherpad.wikimedia.org/toollabs-ideal-getting-started [10:15:40] ...which will be removed the next webservice run anyway once it's removed from code. [10:16:02] valhallasw: sigh. fine. [10:16:15] err [10:16:16] actually [10:16:20] doing it in save_manifest [10:16:26] won't affect all the files that currently exist [10:16:30] which is quite a few hundred of them [10:16:35] so some files will have a comment and some won't [10:16:56] na und? [10:17:29] and if you feel strongly about it, submit a patch. I'm going home. [10:17:35] good night [10:17:49] (no, really, I was going to leave at 3AM and I feel like we'll argue until it's 5AM) [10:18:36] and make sure it works for pre-existing files that aren't going to be written to as well [10:18:50] ok, so before I go - some more context about that etherpad link [10:18:53] 'it works'? [10:19:02] 'it works'? [10:19:14] !log bounce statsite and uwsgi on graphite1001 [10:19:20] Logged the message, Master [10:19:24] you shouldn't end up with some service.manifest files with a comment [10:19:26] and most without... [10:19:29] since that's just confusing [10:20:57] (03PS1) 10Merlijn van Deen: Add warning comment to manifest file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211098 [10:21:26] No, I'm not going to edit 100s of files to add the comment you should have added in the first place :P [10:22:02] .... [10:22:38] (03CR) 10Yuvipanda: [C: 04-1] "This won't affect all the service.manifest files that already exist, and hence will be fairly confusing." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211098 (owner: 10Merlijn van Deen) [10:33:21] (03PS2) 10Merlijn van Deen: Add warning comment to manifest file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211098 [10:33:57] (03CR) 10Merlijn van Deen: "If those need to be changed, that isn't part of this patchset. This will just make sure the comment is there at a subsequent webservice ru" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211098 (owner: 10Merlijn van Deen) [10:44:20] <_joe_> (btw, adding a line at the start of a file via unix commands is one of my traditional interview questions to ops, it usually gets interesting answers). [10:47:03] (03CR) 10Merlijn van Deen: [C: 04-1] Support --release param for backwards compatibility (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211066 (owner: 10Yuvipanda) [10:49:19] (03CR) 10Merlijn van Deen: [C: 032] Add support for service manifests [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/211080 (owner: 10Yuvipanda) [11:07:18] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [11:09:20] (03Draft4) 10Dereckson: WIP: cn.wikimedia.org initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211103 (https://phabricator.wikimedia.org/T98676) [11:20:37] (03CR) 10Glaisher: "Why is this separate?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211094 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [11:32:11] (03CR) 10Dereckson: "This commit imports a static resource from Wikimedia Commons, at the disposal of the repository." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211094 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [11:34:56] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1287522 (10Glaisher) [11:45:02] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1287532 (10Dereckson) # We need to prepare the initial wiki configuration. A first draft to review with sensible options is done, and op... [11:49:08] 6operations, 7Graphite: limit the impact of many new metrics being pushed to graphite - https://phabricator.wikimedia.org/T99233#1287541 (10fgiunchedi) 3NEW a:3fgiunchedi [11:50:14] 6operations, 7Graphite: improve graphite operational documentation - https://phabricator.wikimedia.org/T99234#1287549 (10fgiunchedi) 3NEW a:3fgiunchedi [11:50:50] (03PS1) 10Dereckson: Added cn.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/211109 [11:51:18] (03PS11) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) [11:51:31] (03PS2) 10Dereckson: Added cn.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/211109 (https://phabricator.wikimedia.org/T98676) [11:52:23] (03CR) 10coren: [C: 031] "This is ready to deploy once labstore1001 has been rejiggered to use raid10" [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [11:53:13] bblack: If you have a few minutes to take a peek at the rejoggered https://gerrit.wikimedia.org/r/#/c/209558/ , I'd appreciate it [11:57:03] (03CR) 10coren: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/211063 (owner: 10Andrew Bogott) [11:58:14] (03PS1) 10Dereckson: Added cn.wikimedia.org in Apache vhosts configuration [puppet] - 10https://gerrit.wikimedia.org/r/211112 (https://phabricator.wikimedia.org/T98676) [11:59:38] (03CR) 10coren: [C: 031] "I'm not fond of having the mapping in puppet (as I wasn't of the comparable scheme with dnsmasq), but short of having actual split-horizon" [puppet] - 10https://gerrit.wikimedia.org/r/211059 (owner: 10Andrew Bogott) [12:02:16] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1287574 (10Dereckson) [12:02:44] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1287575 (10AddisWang) >>! In T98676#1287532, @Dereckson wrote: > # We need to prepare the initial wiki configuration. A first draft to r... [12:21:04] !log elastic1020 es-tool restart-fast [12:21:09] Logged the message, Master [12:24:25] (03PS5) 10Dereckson: cn.wikimedia.org initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211103 (https://phabricator.wikimedia.org/T98676) [12:26:07] (03CR) 10Dereckson: "PS5: dual license, php-1.26wmf5 added to wikiversions.json (valid until next Tuesday)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211103 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [12:35:26] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1287607 (10Reedy) We also use it for fixing some erroneous results in the Google Search; such as when they're marked as being hacked or similar by Google [12:44:01] (03CR) 10coren: [C: 032] Add wb_changes_subscription and wbc_entity_usage tables [software] - 10https://gerrit.wikimedia.org/r/210057 (https://phabricator.wikimedia.org/T98748) (owner: 10Aude) [12:46:45] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1287639 (10Dereckson) @greg Could we have a window to create this wiki? Configuration changes (Apache, DNS, config) are ready (I have no... [12:47:34] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1287648 (10Dereckson) [12:49:25] thanks Coren [12:50:08] aude: Is it expected that not all databases have a wbc_entity_usage table? [12:50:33] (03CR) 10Glaisher: Added cn.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/211109 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [12:51:15] Coren: i think yes, expected [12:51:26] * Coren nods. [12:51:40] something like wikispecies doesn't have wikibase at all [12:51:41] yet [12:51:56] I didn't worry overmuch because of the obvious patterns (wikiversities for instance) [12:52:05] yeah, they don't have wikibase yet also [12:52:29] do i need to poke you when they do get it? [12:52:45] or would it be automatic that the table gets replicated when we create it [12:54:51] (03PS2) 10Merlijn van Deen: phabricator: Add priority keywords/labels for !priority email command [puppet] - 10https://gerrit.wikimedia.org/r/209445 (https://phabricator.wikimedia.org/T98356) [12:54:53] aude: It's semi automatic. They'll get it the next time maintain-replicas is run but that's relatively infrequent. If you poke me I can run it immediately to avoid the delay. [12:55:10] ok [13:02:47] PROBLEM - puppet last run on cp4013 is CRITICAL puppet fail [13:03:41] 6operations, 10Hackathon-Lyon-2015, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1287684 (10jcrespo) >>! In T21986#1283130, @Reedy wrote: > Can't you just do CREATE DATABASE newdb; foreach table in olddb { RENAME TABLE oldd... [13:04:30] (03CR) 10Merlijn van Deen: [C: 031] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [13:04:46] (03PS3) 10Merlijn van Deen: Tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [13:11:01] (03CR) 10Merlijn van Deen: [C: 04-1] "what's the advantage of using hiera() instead of passing it as variable in the class definition? If you want to fall back to $::ssh_hba, y" [puppet] - 10https://gerrit.wikimedia.org/r/209993 (https://phabricator.wikimedia.org/T98714) (owner: 10Yuvipanda) [13:11:18] (03PS1) 10KartikMistry: CX: Enable 'cxstats' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211116 [13:18:58] RECOVERY - puppet last run on cp4013 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:43:39] jynus: I’m here now. What was in that extra 87Gb? [13:44:04] andrewbogott, talk you you in private [13:44:09] 'k [13:47:48] jynus: welcome! [13:47:51] hi andrewbogott :) [13:48:01] * andrewbogott waves hello [13:48:12] matanya, hi, thank you! [13:48:56] I wish you best of luck [13:49:34] I will need it! Lots of things to learn and do [13:50:12] springle will be very useful and helpful, don't worry :D [13:50:23] I has already been! [13:51:11] but one feels stupid with basic things, hopefully only for some days [13:51:36] PROBLEM - Host tellurium is DOWN: PING CRITICAL - Packet loss = 100% [13:51:54] yes, our DB things are complicated, i am sure you will be just fine shortly [13:52:48] why schedule maintenance if it's still going to alarm ugh! re: tellurium [13:52:54] ah it was you then [13:53:05] I just got the page [13:53:45] yeah, it's me...i schedule 2 hour window on icinga [13:55:01] (03PS1) 10Andrew Bogott: Use m5-master.eqiad.wmnet for the openstack/labs db server [puppet] - 10https://gerrit.wikimedia.org/r/211120 [13:57:43] (03CR) 10Andrew Bogott: [C: 032] Adding master host CNAME for new mariadb shard on labs (m5) This is required for MySQL scripts to work properly. [dns] - 10https://gerrit.wikimedia.org/r/210859 (https://phabricator.wikimedia.org/T92693) (owner: 10Jcrespo) [14:00:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL puppet fail [14:03:26] RECOVERY - Host tellurium is UPING OK - Packet loss = 0%, RTA = 1.28 ms [14:05:07] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL puppet fail [14:05:16] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL puppet fail [14:05:16] PROBLEM - check_puppetrun on payments1001 is CRITICAL puppet fail [14:05:21] (03CR) 10Andrew Bogott: [C: 032] Use m5-master.eqiad.wmnet for the openstack/labs db server [puppet] - 10https://gerrit.wikimedia.org/r/211120 (owner: 10Andrew Bogott) [14:05:27] PROBLEM - check_puppetrun on bismuth is CRITICAL puppet fail [14:10:08] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL puppet fail [14:10:08] PROBLEM - check_puppetrun on lutetium is CRITICAL puppet fail [14:10:08] PROBLEM - check_puppetrun on db1008 is CRITICAL puppet fail [14:10:09] PROBLEM - nova-compute process on virt1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:10:16] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL puppet fail [14:10:16] PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail [14:10:16] PROBLEM - check_puppetrun on payments1003 is CRITICAL puppet fail [14:10:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL puppet fail [14:10:17] PROBLEM - check_puppetrun on payments1004 is CRITICAL puppet fail [14:10:36] PROBLEM - check_puppetrun on bismuth is CRITICAL puppet fail [14:10:46] PROBLEM - nova-compute process on virt1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:11:06] PROBLEM - nova-scheduler process on virt1000 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler [14:11:27] PROBLEM - Host tellurium is DOWN: PING CRITICAL - Packet loss = 100% [14:11:47] PROBLEM - nova-compute process on virt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:12:15] <_joe_> Jeff_Green: that you? [14:12:41] <_joe_> also andrewbogott what's un with the nova alarms? [14:12:49] <_joe_> *up [14:12:54] _joe_: that’s me, sorry [14:13:05] _joe_ that's me [14:13:06] <_joe_> andrewbogott: is it expected I mean? [14:13:07] PROBLEM - nova-compute process on virt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:13:12] andrewbogott, I will schedule downtime for you [14:13:18] cmjohnson1: payments etc is you? [14:13:30] _joe_: yeah, sort of. I’m running into trouble with dns cache [14:13:37] the puppet fails is me [14:13:43] ok thanks [14:13:49] <_joe_> cmjohnson1: oh right [14:14:56] PROBLEM - check_puppetrun on backup4001 is CRITICAL puppet fail [14:15:07] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL puppet fail [14:15:07] PROBLEM - check_puppetrun on db1008 is CRITICAL puppet fail [14:15:07] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 44 failures [14:15:08] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [14:15:16] PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail [14:15:16] PROBLEM - check_puppetrun on indium is CRITICAL puppet fail [14:15:16] PROBLEM - check_puppetrun on lutetium is CRITICAL puppet fail [14:15:17] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL puppet fail [14:15:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL puppet fail [14:15:17] PROBLEM - check_puppetrun on db1025 is CRITICAL puppet fail [14:15:17] PROBLEM - check_puppetrun on payments1003 is CRITICAL puppet fail [14:15:17] PROBLEM - check_puppetrun on silicon is CRITICAL puppet fail [14:15:18] PROBLEM - check_puppetrun on payments1004 is CRITICAL puppet fail [14:15:20] (03PS1) 10Andrew Bogott: Reverting just for a moment, until dns sorts itself out. [puppet] - 10https://gerrit.wikimedia.org/r/211122 [14:15:27] PROBLEM - check_puppetrun on bismuth is CRITICAL puppet fail [14:16:09] _joe_: I’m moving db access to m5-master.eqiad.wmnet, which, that name resolved when I checked, but now that I’ve checked 10 times it turns out to resolve only 1 time out of 3. [14:16:35] (03CR) 10Andrew Bogott: [C: 032] Reverting just for a moment, until dns sorts itself out. [puppet] - 10https://gerrit.wikimedia.org/r/211122 (owner: 10Andrew Bogott) [14:16:52] <_joe_> andrewbogott: you got to account for a) dns caches b) persistent connections c) broken clients that will cache dns indefinitely [14:17:06] yeah. Should’ve waited longer after the dns update. [14:17:46] PROBLEM - nova-compute process on labvirt1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:18:24] (03PS16) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [14:18:24] <_joe_> andrewbogott: what's the impact of this? [14:18:27] <_joe_> is labs down? [14:18:34] no [14:18:48] only instance creation/deletion. And possibly new wikitech logins. [14:18:56] Anyway, I’m reverting, will be fixed in a few seconds. [14:18:58] PROBLEM - nova-compute process on labvirt1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:19:16] PROBLEM - nova-compute process on virt1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:19:36] RECOVERY - nova-scheduler process on virt1000 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-scheduler [14:19:53] <_joe_> andrewbogott: I'm not sure this has to do with DNS caching btw [14:19:56] PROBLEM - check_puppetrun on backup4001 is CRITICAL puppet fail [14:20:06] (03PS1) 10Ottomata: Rsync CirrusSearchRequests.log from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/211123 (https://phabricator.wikimedia.org/T98383) [14:20:09] PROBLEM - nova-compute process on labvirt1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:20:09] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL puppet fail [14:20:09] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [14:20:09] PROBLEM - check_puppetrun on db1008 is CRITICAL puppet fail [14:20:10] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 44 failures [14:20:10] RECOVERY - check_puppetrun on lutetium is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:20:16] RECOVERY - check_puppetrun on indium is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:20:16] PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail [14:20:16] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL puppet fail [14:20:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL puppet fail [14:20:17] PROBLEM - check_puppetrun on db1025 is CRITICAL puppet fail [14:20:17] RECOVERY - check_puppetrun on payments1003 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:20:17] PROBLEM - check_puppetrun on silicon is CRITICAL puppet fail [14:20:18] PROBLEM - check_puppetrun on payments1004 is CRITICAL puppet fail [14:20:27] PROBLEM - check_puppetrun on bismuth is CRITICAL puppet fail [14:20:32] <_joe_> damn FR [14:20:46] _joe_: how do you mean? [14:20:48] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:20:49] (03CR) 10jenkins-bot: [V: 04-1] Rsync CirrusSearchRequests.log from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/211123 (https://phabricator.wikimedia.org/T98383) (owner: 10Ottomata) [14:21:16] <_joe_> m5-master.eqiad.wmnet just doesn't resolve [14:21:20] (03PS1) 10Springle: repool db1053 in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211124 [14:21:37] PROBLEM - nova-compute process on virt1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:21:42] (03PS2) 10Ottomata: Rsync CirrusSearchRequests.log from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/211123 (https://phabricator.wikimedia.org/T98383) [14:21:57] PROBLEM - nova-compute process on labvirt1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:22:27] RECOVERY - nova-compute process on labvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:22:27] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:22:37] RECOVERY - nova-compute process on virt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:22:38] RECOVERY - nova-compute process on virt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:22:52] <_joe_> andrewbogott: I mean how can it be a dns caching problem? [14:22:57] RECOVERY - nova-compute process on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:23:06] <_joe_> m5-master.eqiad.wmnet has a TTL of 300s [14:23:16] RECOVERY - Host tellurium is UPING OK - Packet loss = 0%, RTA = 3.41 ms [14:23:22] RECOVERY - nova-compute process on virt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:23:23] I don’t know. I just created that entry a moment ago. It resolves intermittently for me [14:23:24] <_joe_> so how can that be the problem? [14:23:26] RECOVERY - nova-compute process on virt1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:23:35] <_joe_> andrewbogott: from where exactly? [14:23:40] <_joe_> "intermittently"? [14:23:42] virt1000 [14:23:47] RECOVERY - nova-compute process on virt1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:23:53] <_joe_> that would mean our dns setup is screwed up bad [14:24:02] have a look :) [14:24:06] I think it has not been added to the labs dns? [14:24:12] If I dig it resolves sometimes but not always. [14:24:13] <_joe_> andrewbogott: where are nova logs located? [14:24:18] which ones? [14:24:27] <_joe_> andrewbogott: from virt1000? [14:24:37] <_joe_> andrewbogott: nova-compute [14:24:38] /var/log/nova [14:24:49] nova-compute runs on labvirt100* [14:24:56] PROBLEM - check_puppetrun on backup4001 is CRITICAL puppet fail [14:25:01] so in that case the logs are in /var/log/nova on those boxes [14:25:01] (03CR) 10Springle: [C: 032] repool db1053 in s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211124 (owner: 10Springle) [14:25:07] RECOVERY - check_puppetrun on pay-lvs1002 is OK Puppet is currently enabled, last run 140 seconds ago with 0 failures [14:25:07] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 44 failures [14:25:07] RECOVERY - check_puppetrun on db1008 is OK Puppet is currently enabled, last run 132 seconds ago with 0 failures [14:25:09] (03Abandoned) 10OliverKeyes: Add fluorine rsync connector [puppet] - 10https://gerrit.wikimedia.org/r/209684 (owner: 10OliverKeyes) [14:25:16] RECOVERY - check_puppetrun on samarium is OK Puppet is currently enabled, last run 133 seconds ago with 0 failures [14:25:16] RECOVERY - check_puppetrun on barium is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:25:16] RECOVERY - check_puppetrun on pay-lvs1001 is OK Puppet is currently enabled, last run 146 seconds ago with 0 failures [14:25:17] RECOVERY - check_puppetrun on payments1001 is OK Puppet is currently enabled, last run 85 seconds ago with 0 failures [14:25:17] RECOVERY - check_puppetrun on db1025 is OK Puppet is currently enabled, last run 142 seconds ago with 0 failures [14:25:17] RECOVERY - check_puppetrun on silicon is OK Puppet is currently enabled, last run 145 seconds ago with 0 failures [14:25:17] RECOVERY - check_puppetrun on payments1004 is OK Puppet is currently enabled, last run 143 seconds ago with 0 failures [14:25:27] RECOVERY - check_puppetrun on bismuth is OK Puppet is currently enabled, last run 140 seconds ago with 0 failures [14:26:24] !log springle Synchronized wmf-config/db-eqiad.php: repool db1053 in s1, warm up (duration: 00m 13s) [14:26:33] Logged the message, Master [14:26:54] _joe_: anyway, services should be restored for now, back on the old db [14:26:57] PROBLEM - nova-compute process on labvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:27:28] PROBLEM - nova-compute process on labvirt1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:27:28] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:27:58] PROBLEM - nova-compute process on labvirt1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:28:17] PROBLEM - nova-compute process on labvirt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:28:38] RECOVERY - nova-compute process on virt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:29:57] PROBLEM - check_puppetrun on backup4001 is CRITICAL puppet fail [14:30:07] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 80 seconds ago with 0 failures [14:34:07] RECOVERY - nova-compute process on labvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:34:37] RECOVERY - nova-compute process on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:34:56] (03PS3) 10Ottomata: Rsync CirrusSearchRequests.log from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/211123 (https://phabricator.wikimedia.org/T98383) [14:34:57] RECOVERY - check_puppetrun on backup4001 is OK Puppet is currently enabled, last run 126 seconds ago with 0 failures [14:34:57] RECOVERY - nova-compute process on labvirt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:35:07] RECOVERY - nova-compute process on labvirt1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:35:08] RECOVERY - nova-compute process on labvirt1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:35:08] (03CR) 10Ottomata: [C: 032 V: 032] Rsync CirrusSearchRequests.log from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/211123 (https://phabricator.wikimedia.org/T98383) (owner: 10Ottomata) [14:35:16] RECOVERY - nova-compute process on labvirt1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:35:47] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:35:51] hey godog, yt? [14:36:14] ok, here’s a dumb question: How do I schedule downtime for a host in icinga? I know how to do it for a /service/ but I can’t get the web ui to display up/down [14:36:38] andrewbogott: i thikn if you go to the host icinga page [14:36:42] you can do schedule downtime for host [14:36:48] andrewbogott: there should be a 'schedule downtime' button on the host page [14:37:20] !log elastic1021 es-tool restart-fast [14:37:26] Logged the message, Master [14:37:26] ok, I see it. Tedious! [14:38:03] andrewbogott: tedious? Neon should have a script for it as well iirc [14:38:10] Not sure though :) [14:40:17] PROBLEM - puppet last run on fluorine is CRITICAL puppet fail [14:40:18] (03PS1) 10Ottomata: Fix for stat1002 CirrusSearchRequests.log rsync job [puppet] - 10https://gerrit.wikimedia.org/r/211126 (https://phabricator.wikimedia.org/T98383) [14:40:23] (03CR) 10jenkins-bot: [V: 04-1] Fix for stat1002 CirrusSearchRequests.log rsync job [puppet] - 10https://gerrit.wikimedia.org/r/211126 (https://phabricator.wikimedia.org/T98383) (owner: 10Ottomata) [14:40:29] (03PS2) 10Ottomata: Fix for stat1002 CirrusSearchRequests.log rsync job [puppet] - 10https://gerrit.wikimedia.org/r/211126 (https://phabricator.wikimedia.org/T98383) [14:40:58] PROBLEM - puppet last run on stat1002 is CRITICAL puppet fail [14:42:06] (03CR) 10Ottomata: [C: 032] Fix for stat1002 CirrusSearchRequests.log rsync job [puppet] - 10https://gerrit.wikimedia.org/r/211126 (https://phabricator.wikimedia.org/T98383) (owner: 10Ottomata) [14:44:17] RECOVERY - puppet last run on stat1002 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [14:44:25] (03PS1) 10Andrew Bogott: Use m5-master.eqiad.wmnet for the openstack/labs db server [puppet] - 10https://gerrit.wikimedia.org/r/211127 [14:44:45] (03PS8) 10Mobrovac: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:45:10] (03PS9) 10Mobrovac: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:45:16] RECOVERY - puppet last run on fluorine is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:46:03] ok, trying again — sorry in advance if I missed downtiming yet more nova services [14:48:47] (03PS10) 10Mobrovac: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:48:47] akosiaris: yt? [14:49:46] !log migrating mariadb service from virt1000 to m5-master [14:49:51] Logged the message, Master [14:50:47] ottomata: an1012 + an1022 alerts [14:51:10] 6operations: Allow rsync traffic between analytics VLAN and fluorine - https://phabricator.wikimedia.org/T99245#1287982 (10Ottomata) 3NEW [14:51:19] 1012! i know about 1022. [14:51:23] looking [14:51:24] 6operations, 10Mathoid, 6Services, 5Patch-For-Review: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1287989 (10mobrovac) a:3mobrovac [14:51:49] (03CR) 10Andrew Bogott: [C: 032] Use m5-master.eqiad.wmnet for the openstack/labs db server [puppet] - 10https://gerrit.wikimedia.org/r/211127 (owner: 10Andrew Bogott) [14:51:56] paravoid: , 1012? [14:52:08] it disappeared now [14:52:16] it was a kafka check, the message was "(null)" [14:52:49] hm, strange, ok [14:53:08] paravoid: an22 is this: https://phabricator.wikimedia.org/T99105 [14:53:23] i looked a bit into it, somehow 2 upload partitions are on a single disk. [14:53:37] 6operations, 10Mathoid, 6Services, 5Patch-For-Review: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1287992 (10mobrovac) @akosiaris , `mediawiki/services/mathoid` and `mediawiki/services/mathoid/deploy` and now set up as needed. I also took the liberty to improve the [mathoid... [14:53:45] when we get new hardware next quarter, i might investigate using raid as a singel mount for kafka data, rather than letting the broker balance [14:53:46] not sure. [14:54:57] PROBLEM - puppet last run on mw1021 is CRITICAL Puppet last ran 4 days ago [14:56:06] <_joe_> I'm taking care of this ^^ [14:56:37] RECOVERY - puppet last run on mw1021 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:57:55] paravoid: can you help with this? [14:57:56] https://phabricator.wikimedia.org/T99245 [14:58:02] oh i have another request too, filing... [14:58:37] uh, what? [14:58:44] mediawiki logs to stat? how come [15:00:07] paravoid: https://phabricator.wikimedia.org/T98383 for CirrusSearchRequests.log [15:02:03] (Impossibile accedere al server del database: Can't connect to MySQL server on '10.64.16.29' (4) (10.64.16.29)) [15:04:27] ottomata: so why can't stat1002 just subscribe to the udp2log stream and log directly? [15:04:34] why do we need to rsync logs around? [15:04:36] 6operations, 10Analytics-EventLogging: Allow eventlogging ZeroMQ traffic to be consumed inside of the Analytics VLAN. - https://phabricator.wikimedia.org/T99246#1288016 (10Ottomata) 3NEW [15:04:44] paravoid: udp2log is not a pub/sub :) [15:05:29] is it not multicast? [15:05:32] no [15:05:38] we could set up a relay [15:05:41] like we did for webrequest [15:05:42] but uhhh [15:05:46] do we really want to do that? [15:05:49] why not rsync? [15:06:01] really, mw shoudl just put this stuff in kafka!~ [15:06:07] i want eventlogging to be able to do that too [15:06:12] one day i'll implement a MW Kafka logger [15:06:15] that will be fun :) [15:06:17] why? [15:06:24] why kafka? [15:06:30] because we are trying to do away with udp2log! [15:06:33] why would we transfer blobs around with rsync for a log stream [15:07:03] well mediawiki logs synchronously and you really don't want a kafka outage to result in a site outage -- may I remind you the whole logstash-related/c4 outage :) [15:07:10] ??? [15:07:17] site outage? [15:07:17] PROBLEM - puppet last run on virt1000 is CRITICAL puppet fail [15:07:28] oh [15:07:43] mw logs synchronously to udp2log? [15:07:43] anyway, gotta go, my flight is leaving soon [15:07:45] it is udp [15:07:49] not sure how it does that :) [15:07:58] * paravoid waves [15:08:01] ooooook [15:08:08] laterrrrrs [15:08:19] mayyybe akosiaris can help meeeeeeeeeeeee [15:08:19] :) [15:09:36] PROBLEM - High load average on labstore1001 is CRITICAL 87.50% of data above the critical threshold [24.0] [15:11:04] ottomata: your scenario might be another use-case for https://phabricator.wikimedia.org/T84923 [15:13:57] RECOVERY - puppet last run on virt1000 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:27] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [15:14:40] mobrovac: cool! [15:15:03] mobrovac: are you aware of http://confluent.io/docs/current/kafka-rest/docs/index.html? [15:16:01] ottomata: that's really useful to know! cheers [15:16:55] mobrovac: the platform confluent is building and recommending is really really good. if wmf can do something like that for many services, i think we would be in a much better place [15:17:00] big problem is: it all feels very java-y [15:17:18] eh [15:17:21] json isn't well supported, that is, if you want to use their schema-registry [15:17:28] which is really nice. [15:19:11] it uses zookepper [15:19:12] hm [15:19:17] ottomata: thanks [15:20:01] 6operations, 10Analytics, 10MediaWiki-General-or-Unknown, 6Services, and 4 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1288059 (10GWicke) [15:20:34] damn, gwicke was faster to update that [15:20:36] :P [15:20:42] 6operations, 10Analytics, 10MediaWiki-General-or-Unknown, 6Services, and 4 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#933968 (10GWicke) [15:21:06] bblack or _joe_ can you help out jynus with a firewall issue? [15:21:10] (03CR) 10Physikerwelt: [C: 04-1] "Merge with care." [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [15:21:13] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288061 (10Krenair) >>! In T98676#1287485, @Dereckson wrote: > * Extensions enabled: Securepoll Note that SecurePoll is default on all... [15:21:19] We’re trying to give labnet1001 mysql access to db1009. [15:21:39] 6operations, 10Analytics, 10MediaWiki-General-or-Unknown, 6Services, and 4 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1288062 (10mobrovac) [15:22:08] gwicke: i am totally biased because I am a regular at the Church of Kafka [15:22:20] but, this is so good: [15:22:21] http://confluent.io/docs/current/platform.html [15:22:32] everything should be built on it! :) [15:22:59] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288064 (10Krenair) >>! In T98676#1287521, @AddisWang wrote: > This is my first time using phabricator, would you mind telling me how do... [15:23:01] model your services as events, enforce a schema, put everything in kafka. [15:23:02] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288065 (10Krenair) >>! In T98676#1287521, @AddisWang wrote: > This is my first time using phabricator, would you mind telling me how do... [15:23:23] ottomata: I generally like Kafka, but have some doubts around long-term event storage and cross-DB replication [15:23:35] make that cross-DC [15:24:09] mirrormaker style? [15:24:31] but it might make sense to separate long-term storage from limited buffering in Kafka, as the requirements might be difficult to satisfy at the same time [15:24:35] mark or paravoid, a little help with network filtering? [15:25:30] (03CR) 10Alex Monk: "This should really be part of the dependent change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211094 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [15:25:31] ottomata: mirrormaker sounds fine for master-slave [15:25:50] but I'm not sure how manageable that would be with active-active and more than two DCs [15:25:53] I guess every single person who knows about network config is gone :( [15:26:18] aye, gwicke not so much master slave, but production clusters -> aggregate/analytics cluster [15:26:33] bblack: ? [15:26:35] but, linkedin does it at a MUCH higher volume that we do. [15:26:41] andrewbogott: ? [15:26:42] but i dont' have any experience with MirrorMaker yet [15:26:52] oh! [15:26:53] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288066 (10Krenair) >>! In T98676#1287639, @Dereckson wrote: > @greg Could we have a window to create this wiki? Configuration changes (... [15:27:03] bblack: we’re having trouble getting traffic to db1009 from labnet1001 [15:27:12] ah yes, I seem to recall this from yesterday! [15:27:14] bblack: probably a filter someplace [15:27:16] but gwicke, ja if we were to start using Kafka as a production job queue (and other things) we would want it to be in a separate kafka cluster than the analytics one [15:27:17] ottomata: my main concern is more complexity; I think for the job and low-volume event use case Kafka is more than adequate [15:27:27] for sure. [15:27:38] but, i would like to see Kafka used for more production things. [15:27:39] andrewbogott: but I think that's intentional in the filters, and no I still don't quite have time to look at it, either. [15:27:56] bblack: ah, there’s an outage happening right now as a result. I guess I can roll back though [15:28:01] (or really know much about it) [15:28:05] ok [15:28:43] probably rolling back would be best, yes [15:28:50] (03PS1) 10Andrew Bogott: Dang. [puppet] - 10https://gerrit.wikimedia.org/r/211129 [15:28:55] ottomata: we were just talking about how we need to start figuring this out soon; will keep you in the loop [15:29:19] ja, cool, thanks [15:29:23] going to add myself as subscriber [15:29:30] bblack: that may cause data integrity issues though [15:29:48] heh [15:29:48] !log elastic1022 es-tool restart-fast [15:29:57] nobody tested that access would work first? [15:30:15] !log springle Synchronized wmf-config/db-eqiad.php: raise db1053 load (duration: 00m 12s) [15:32:05] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288075 (10greg) >>! In T98676#1274501, @Slaporte wrote: > I'll check the policy and get back to you here. Just confirming: @Slaporte:... [15:32:52] (03CR) 10Andrew Bogott: [C: 032] Dang. [puppet] - 10https://gerrit.wikimedia.org/r/211129 (owner: 10Andrew Bogott) [15:34:22] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288084 (10Krenair) See his next comment... [15:36:19] (03CR) 10GWicke: "Once this patch is merged, we can disable Parsoid cache update jobs. That will reduce the load on the Parsoid cluster by about half, and a" [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke) [15:36:56] 7Blocked-on-Operations, 10RESTBase, 5Patch-For-Review: Deploy RESTBase to group1 wikis - https://phabricator.wikimedia.org/T93452#1288086 (10GWicke) [15:37:53] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288089 (10Slaporte) >>! In T98676#1288075, @greg wrote: >>>! In T98676#1274501, @Slaporte wrote: >> I'll check the policy and get back... [15:47:41] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288140 (10AddisWang) >>! In T98676#1288066, @Krenair wrote: >>>! In T98676#1287639, @Dereckson wrote: >> @greg Could we have a window t... [15:47:59] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288141 (10AddisWang) >>! In T98676#1288089, @Slaporte wrote: >>>! In T98676#1288075, @greg wrote: >>>>! In T98676#1274501, @Slaporte wr... [15:49:18] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [15:49:27] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288142 (10Krenair) >>! In T98676#1288140, @AddisWang wrote: >>>! In T98676#1288066, @Krenair wrote: >>>>! In T98676#1287639, @Dereckson... [15:54:09] springle: so if this task is (now? always was?) blocked on faidon, probably best to find something else for Jaime in the meantime. [15:55:53] 7Blocked-on-Operations, 10RESTBase, 5Patch-For-Review: Deploy RESTBase to group1 wikis - https://phabricator.wikimedia.org/T93452#1288156 (10Joe) @GWicke is this scheduled in the deployments calendar? I think it's the kind of thing that should fit in there in general. [15:55:58] andrewbogott, we can check the iptables conf [15:55:59] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288157 (10AddisWang) >>! In T98676#1288142, @Krenair wrote: >>>! In T98676#1288140, @AddisWang wrote: >>>>! In T98676#1288066, @Krenair... [15:56:28] jynus: I think before we go forward we need to establish some consensus about the premise of this task altogether. [15:56:40] Maybe we need to use a db cluster in the labs-support vlan or something... [15:56:54] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288166 (10Krenair) What's confusing? [15:57:07] But I don’t want to rely on a db server that most of the Ops think shouldn’t exist. [15:58:57] andrewbogott: I don't see the complexity? We've approved putting labs services on a dedicated prod DB (and previously started doing it for pdns/designate on m1). In T92693, it was noted that Faidon needed ot be involved for questions on vlans rules; which is still the case. [15:59:16] Ah, so you mean he needs to be involved in implementing, not in approving/designing? [15:59:26] I misunderstood [15:59:28] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288174 (10AddisWang) >>! In T98676#1288166, @Krenair wrote: > What's confusing? Well I guess it need to go through some process. I'm... [15:59:44] andrewbogott: correct. "Check if we need any special network/vlan rules" [16:00:07] springle: ok, that’s less discouraging :) [16:00:46] andrewbogott: policy-wise, this is happening. just the switchover seems to have jumped the gun somewhere along the line. No real problems, hopefully :) [16:02:26] springle: yeah, mostly it was my misunderstanding of nova design — all db access is /supposed/ to be marshalled through a service running on virt1000, but apparently they don’t actually follow that rule as much as I thought. [16:02:46] springle: but, getting pushback from brandon about ‘is this even safe’ caught me off guard. [16:03:25] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288202 (10Krenair) >>! In T98676#1288174, @AddisWang wrote: >>>! In T98676#1288166, @Krenair wrote: >> What's confusing? > > Well I gu... [16:08:09] 7Blocked-on-Operations, 10RESTBase, 5Patch-For-Review: Deploy RESTBase to group1 wikis - https://phabricator.wikimedia.org/T93452#1288216 (10GWicke) @Joe, it was at some point (this ticket is from March), but the patch wasn't merged then. I re-added it for this week. [16:09:43] 7Blocked-on-Operations, 10RESTBase, 5Patch-For-Review: Enable group 1 wikis in RESTBase - https://phabricator.wikimedia.org/T93452#1288220 (10GWicke) [16:16:53] !log bounce statsdlb on graphite1001 [16:17:01] Logged the message, Master [16:23:42] !log elastic1023 and elastic1024 (skipped one log) es-tool restart-fast [16:23:49] Logged the message, Master [16:28:33] am I allowed to add mobrovac to the group mediawiki-services-mathoid (i.e. https://gerrit.wikimedia.org/r/#/admin/groups/697,members) or is there a pre defined process for adding people there [16:29:46] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [16:29:53] !log bounce carbon on graphite1001 [16:29:59] Logged the message, Master [16:30:15] I don't think that this would change any effective rights but would give an overview who has knowlege about the contents of that repository [16:31:42] please let me know if thats against some rule [16:33:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:33:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:33:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:34:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:34:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:34:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:34:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:34:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:34:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:35:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:37:16] physikerwelt: probably no big deal [16:37:25] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288261 (10Dereckson) (You know, [[ https://wikitech.wikimedia.org/wiki/Add_a_wiki | the manual current formulation ]] lets think DNS/Ap... [16:38:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [16:38:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [16:38:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [16:38:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [16:38:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [16:38:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [16:38:52] ^ that's graphite recovering [16:39:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [16:39:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [16:39:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [16:39:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [16:40:12] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288264 (10Glaisher) I updated that documentation a few months back and it still needs lots of love to be perfect (see T87588). :) Feel... [17:01:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [17:05:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [17:07:27] !log still seeing metrics from xhprof creating, looking for source [17:07:34] Logged the message, Master [17:07:37] ori: ^ [17:09:35] I think the nutcracker on mw1081 is sick. Lots and lots of memcached errors from that host like `Memcached error for key "commonswiki:revisiontext:textid:126134301" on server "/var/run/nutcracker/nutcracker.sock:0": SYSTEM ERROR` [17:10:28] lots & lots == 167K in the last 15 minutes [17:10:44] (03CR) 10Physikerwelt: "see Ia7a8b3062493a071460994157f405796d076f5fa" [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [17:10:48] 6operations, 6Search-Team, 7Elasticsearch: Setup backups of elasticsearch indicies - https://phabricator.wikimedia.org/T91404#1288360 (10Manybubbles) [17:11:52] Nemo_bis: Thank you [17:13:54] Can some root restart nutcracker on mw1081? There are 13M memcached errors in logstash from that host in the last 24 hours. [17:14:03] godog: ^ [17:15:07] bd808: done [17:15:28] ottomata: yt ? [17:15:34] akosiaris: thanks. I'll keep watching to see if that helped [17:16:34] akosiaris: looks like that fixed it! :) no new errors in >60s [17:16:41] :-) [17:17:12] 'Have you tried turning it off and on again' [17:17:16] someday I'll figure out how to actually get us some monitoring on these kinds of things [17:17:48] logging and monitoring team? :) [17:18:04] LMM (logging, monitoring and metrics) [17:18:04] bd808: happy to talk about it at the hackathon btw [17:19:51] akosiaris: ya, bout to head out [17:19:57] !log bounce hhvm on mw1017 [17:20:05] Logged the message, Master [17:20:07] PROBLEM - check_ipn_redir on barium is CRITICAL: Connection refused [17:20:08] was going to ping you about these: [17:20:08] https://phabricator.wikimedia.org/T99246 [17:20:11] https://phabricator.wikimedia.org/T99246 [17:20:14] oops [17:20:24] https://phabricator.wikimedia.org/T99245 [17:20:25] ottomata: were you looking for me earlier too? [17:20:36] that second one is more important [17:20:46] godog: ja, was going to brain bounce w you about reqstats [17:20:55] but will have to do that later [17:21:12] ottomata: ok! [17:21:34] akosiaris: gotta go, thanks, laters! [17:25:16] PROBLEM - check_ipn_redir on barium is CRITICAL: Connection refused [17:29:18] !log clean up remaining xhprof metrics from graphite1001 [17:29:26] Logged the message, Master [17:30:07] RECOVERY - check_ipn_redir on barium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 506 bytes in 0.013 second response time [17:33:36] !log updating qemu binaries on labvirt1001 [17:33:41] Logged the message, Master [17:41:01] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 3 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1288457 (10aaron) 5Open>3Resolved Base i/o is down 33% and I have yet to see any new spikes at http://gan... [17:41:06] 6operations: Allow rsync traffic between analytics VLAN and fluorine - https://phabricator.wikimedia.org/T99245#1288460 (10akosiaris) hole opened on both cr1-eqiad and cr2-eqiad. [17:43:59] manybubbles ^d how's the restart? [17:44:28] !log rolling restart almost done on elastic1025 - 1026 is next! [17:44:36] Logged the message, Master [17:44:36] godog: I suspect I'll finish them this afternoon [17:44:50] I'm just keeping the window open in the background [17:45:01] <^d> godog: I've just been idly watching [17:45:02] <^d> :) [17:45:06] the timeout only happened to me twice [17:45:51] while enabling replication back again? yeah saw that a couple of times yesterday [17:46:15] godog: bleh. [17:46:29] I've filed a phab ticket for it. maybe someone on the search and discovery team will grab it [17:46:53] indeed, looks good otherwise (?) [17:46:55] I suspect it's caused by elasticsearch getting super busy on admin work [17:47:16] godog: yeah - if that didn't happen I think we could totally automate it. and the actually roll out seems to be doing good things [17:47:54] manybubbles: are you bumping the elasticsearch version or something else? [17:48:02] nice! [17:48:03] bd808: bumping some plugin versions [17:48:07] * godog has to go [17:48:09] *nod* [17:48:12] godog: bye [17:48:20] bd808: I'm waiting for 1.6 to bump the es version [17:48:36] I just don't feel like changing a bunch of things at once [17:48:38] bye! [17:48:59] if we get 1.6 then we get sealing which, if we do the legwork, can much the rolling restarts much faster [17:49:05] I'm hoping for an hour [17:49:34] 6operations, 10Analytics-EventLogging: Allow eventlogging ZeroMQ traffic to be consumed inside of the Analytics VLAN. - https://phabricator.wikimedia.org/T99246#1288492 (10akosiaris) 5Open>3Resolved a:3akosiaris done. I just punched holes for the mentioned ports. Principle of least priviledge and all. Ch... [17:50:25] manybubbles: yeah. sealing seems like it will be a huge win for the logstash cluster too [17:50:40] 6operations: Allow rsync traffic between analytics VLAN and fluorine - https://phabricator.wikimedia.org/T99245#1288495 (10akosiaris) Tested this and packets do pass the ACL successfully, not resolving since some ferm changes need to be done as well [17:53:04] (03PS2) 10Milimetric: [WIP] Add parallel kafka pipeline [puppet] - 10https://gerrit.wikimedia.org/r/210765 (https://phabricator.wikimedia.org/T98779) [17:55:42] !log migrating of db service from virt1000 to m5-master aborted, service continues on virt1000 [17:55:50] Logged the message, Master [17:57:15] jynus: did you find anything interesting on those database connection failures? Should I just try to make a patch to raise the HHVM side's connect timeout? [17:57:42] bd808, I only searched on the database side [17:57:53] and aside from high load, I found nothing suspicious [17:57:59] ok. [17:58:11] talking with sean he agreed that it is probably what you comment now [17:58:23] *nod* [17:59:02] I will close the ticket, we have at least the info if we have to reopen [18:01:14] bd808, I think there was an HHVM ticket? I want to link it [18:01:33] only if you have it handy [18:01:44] I'm not sure if there was one or not... I can look [18:02:29] serching hhvm gives too many results :-) [18:03:17] there is https://phabricator.wikimedia.org/T98998 [18:03:37] and https://phabricator.wikimedia.org/T98489 [18:04:09] thank you! [18:04:46] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [18:05:36] (Impossibile accedere al server del database: Can't connect to MySQL server on '10.64.48.26' (4) (10.64.48.26)) [18:06:36] 6operations, 7database: Database connection failure issues on s7 shard - https://phabricator.wikimedia.org/T98998#1288525 (10jcrespo) 5Open>3Resolved A database inspection didn't reveal nothing suspicious, aside from "higher load". There is not confirmation, but the main suspect is T98489. Reopen if it hap... [18:08:34] 6operations, 7HHVM: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1288536 (10Nemo_bis) This happened to me at least twice today when saving an edit, on multiple wikis (e.g. ruwiki, dewiki). In one case the edit got saved despite the error, in the other not. [18:28:29] !log bounce hhvm on mw1118 [18:28:34] Logged the message, Master [18:29:03] !log elastic1026 es-tool restart-fast [18:29:08] Logged the message, Master [18:31:22] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Kraftwerk: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1288620 (10AndyRussG) [18:38:21] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Kraftwerk: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1288632 (10AndyRussG) a:3AndyRussG [18:41:56] (03PS2) 10Dzahn: apt: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211069 [18:42:41] (03PS3) 10Dzahn: apt: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211069 [18:43:32] (03CR) 10Dzahn: [C: 032] "this and related ones are all for Bug:T93645 but didn't want to spam phab that much" [puppet] - 10https://gerrit.wikimedia.org/r/211069 (owner: 10Dzahn) [18:43:53] 6operations, 5Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#1288645 (10Dzahn) 5Open>3Resolved [18:44:04] (03PS2) 10Dzahn: chromium: indentation fix [puppet] - 10https://gerrit.wikimedia.org/r/211070 [18:44:18] (03PS3) 10Dzahn: chromium: indentation fix [puppet] - 10https://gerrit.wikimedia.org/r/211070 [18:44:22] (03PS3) 10Krinkle: contint: Use device=none in tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/204542 [18:45:14] (03CR) 10Dzahn: [C: 032] "T93645" [puppet] - 10https://gerrit.wikimedia.org/r/211070 (owner: 10Dzahn) [18:45:33] 6operations, 5Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#1142580 (10Dzahn) [18:45:43] (03PS2) 10Dzahn: datasets: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211071 [18:46:44] 6operations, 5Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#1288653 (10Dzahn) 5Resolved>3Open "Dzahn closed this task as "Resolved" by committing " was not what i wanted to happen. I just mentioned it in one of many patches. [18:47:57] (03PS3) 10Dzahn: datasets: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211071 [18:48:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:49:06] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:49:44] (03CR) 10Dzahn: [C: 032] datasets: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211071 (owner: 10Dzahn) [18:49:58] 6operations, 5Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#1288675 (10Dzahn) 5Open>3Resolved [18:50:21] grmbl, i don't want it to close the ticket [18:51:08] 6operations, 5Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#1288679 (10Dzahn) 5Resolved>3Open [18:51:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [18:51:52] stops spamming now [18:52:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [18:52:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [19:02:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [19:09:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [19:10:20] (03PS3) 10Dzahn: add cn.wikimedia.org and cn.m.wikmedia.org [dns] - 10https://gerrit.wikimedia.org/r/211109 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [19:10:45] (03PS4) 10Dzahn: add cn.wikimedia.org and cn.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/211109 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [19:11:39] (03CR) 10Dzahn: [C: 032] "AffCom approved: https://phabricator.wikimedia.org/T98676#1287461" [dns] - 10https://gerrit.wikimedia.org/r/211109 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [19:12:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [19:13:38] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288755 (10Dzahn) created in DNS: https://cn.wikimedia.org/ https://cn.m.wikimedia.org/ [19:15:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [19:19:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [19:20:57] (03PS1) 10BryanDavis: Set HHVM mysql connection timeout to 3s [puppet] - 10https://gerrit.wikimedia.org/r/211155 (https://phabricator.wikimedia.org/T98489) [19:24:49] (03CR) 10Dzahn: [C: 031] "easy change and +1 but still needs deployment steps after merge" [puppet] - 10https://gerrit.wikimedia.org/r/211112 (https://phabricator.wikimedia.org/T98676) (owner: 10Dereckson) [19:27:09] (03CR) 10Dzahn: [C: 031] "i just feel it will probably cause some unexpected messages for the deployer who runs this when a new host is being added the very first t" [puppet] - 10https://gerrit.wikimedia.org/r/210938 (https://phabricator.wikimedia.org/T95436) (owner: 10John F. Lewis) [19:27:57] (03PS2) 10Dzahn: Add mira to mediawiki-installation dsh [puppet] - 10https://gerrit.wikimedia.org/r/210938 (https://phabricator.wikimedia.org/T95436) (owner: 10John F. Lewis) [19:30:53] 6operations, 10Wikimedia-DNS, 10Wikimedia-Site-requests, 5Patch-For-Review: Create fishbowl wiki for Wikimedia User Group China - https://phabricator.wikimedia.org/T98676#1288816 (10AddisWang) These explanations make a lot of sense to me! Thank you all! [19:32:58] (03PS1) 10Dzahn: zuul: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211165 [19:40:24] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1288867 (10RobH) a:5RobH>3None This is only valid still if T86655 is not valid anymore. As it appears still valid, this would be rejected. [19:41:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [19:43:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [19:44:13] !log elastic1027 es-tool restart-fast [19:44:19] Logged the message, Master [19:46:23] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1288890 (10RobH) Addition: Only valid for long term, short term I suppose its annoying for folks still using SVN. However, are we still really wanting to suppor... [19:46:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [19:48:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [19:52:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [19:55:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [19:57:48] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [20:11:28] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [20:18:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [20:51:14] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1289056 (10Legoktm) [20:51:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [20:54:20] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1289068 (10Bawolff) After reading what our privacy policy actually says. I withdraw my previous comment. However, the video server probably n... [20:56:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [21:09:28] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 7 below the confidence bounds [21:15:41] 6operations, 10Hackathon-Lyon-2015, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1289110 (10Verdy_p) Isn't there a support in the SQL engine to declare a 2nd database as a mirror, (and then let the SQL engine perform the sy... [21:17:07] PROBLEM - puppet last run on achernar is CRITICAL puppet fail [21:35:08] RECOVERY - puppet last run on achernar is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [21:57:07] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [22:27:45] 6operations: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1289430 (10aaron) >>! In T89400#1036074, @Joe wrote: > For the record, we had a failover a couple of months ago and we had no big issues (apart from the jobqueue being briefly down). > > @aaron can you ple... [22:34:10] 6operations, 6Phabricator: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1289465 (10ksmith) Note that a static snapshot would be fine. [22:42:24] dead gerrit bot? [22:47:49] (03CR) 10Alex Monk: "(restarted grrrit-wm, test)" [puppet] - 10https://gerrit.wikimedia.org/r/211305 (owner: 10ArielGlenn) [22:47:52] apergos, ^ [22:48:06] yay [22:48:27] at this hour I just whine instead of fixing things, even if I m clinic duty [22:48:36] 2 am friday night, heh [23:16:30] (03PS2) 10Dzahn: zuul: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211165 [23:20:29] (03CR) 10Dzahn: [C: 032] zuul: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211165 (owner: 10Dzahn) [23:20:37] (03PS1) 10Dzahn: install-server: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211312 [23:24:01] (03PS1) 10Dzahn: wikistats,planet: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211315 [23:24:19] (03PS2) 10Dzahn: install-server: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211312 [23:25:09] (03CR) 10Dzahn: [C: 032] install-server: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211312 (owner: 10Dzahn) [23:25:38] (03PS2) 10Dzahn: wikistats,planet: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211315 [23:26:38] for any opsen that are around on this shift: it's likely that in some hours snapshot1004 will go into swap, on't worry if it whines. just let it do its thing [23:26:54] I'll be looking at it tomorrow morning my time and will clean up anything as needed. [23:27:00] thanks much! [23:27:06] apergos: thanks for the heads up, ok [23:27:14] apergos: Problem is that that can break scap :/ [23:27:24] hoo, it can but only for the one host [23:27:52] apergos: thanks for the warning [23:27:57] yep [23:28:19] it shouldn't swap horribly but ther ewill likely be a slowdown [23:28:44] someday we will start that project to rewrite dumps... someday [23:28:50] in a few more days I hope to have a fix ready to deploy [23:28:56] and that will take care of that [23:29:07] bd808: yep, someday :-) [23:29:43] I was excited that it might happen and then reorg pretty much squashed that hope [23:29:51] (03CR) 10Dzahn: [C: 032] wikistats,planet: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211315 (owner: 10Dzahn) [23:29:56] (03PS1) 10Dzahn: publichtml,ircyall: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211316 [23:30:04] uh yeah I don't really get the plan with the reorg I guess [23:30:26] but at this hour it's not likely I'm going to comprehend much :-D [23:30:48] maybe we can chat about it early next week when we're both awake [23:31:00] (03PS2) 10Dzahn: publichtml,ircyall: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211316 [23:31:31] all right, giving up on snapshot watch and going to bed. see yas! [23:31:37] o/ [23:34:08] (03CR) 10Dzahn: [C: 032] publichtml,ircyall: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211316 (owner: 10Dzahn) [23:34:33] (03PS1) 10Dzahn: icinga ircbot, tor: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211317 [23:34:49] (03PS2) 10Dzahn: icinga ircbot, tor: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211317 [23:37:12] (03PS1) 10Dzahn: ipython, java: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211318 [23:37:34] (03CR) 10Dzahn: [C: 032] icinga ircbot, tor: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211317 (owner: 10Dzahn) [23:38:05] (03PS2) 10Dzahn: ipython, java: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211318 [23:40:22] (03PS1) 10Dzahn: ganeti, rancid: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211319 [23:40:38] (03CR) 10Dzahn: [C: 032] ipython, java: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211318 (owner: 10Dzahn) [23:41:11] (03PS2) 10Dzahn: ganeti, rancid: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211319 [23:43:42] (03PS1) 10Dzahn: statsite, wdq-mm: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211321 [23:44:11] (03CR) 10Ori.livneh: [C: 031] statsite, wdq-mm: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211321 (owner: 10Dzahn) [23:44:30] (03CR) 10Dzahn: [C: 032] ganeti, rancid: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211319 (owner: 10Dzahn) [23:44:54] (03PS2) 10Dzahn: statsite, wdq-mm: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211321 [23:45:54] (03CR) 10Dzahn: [C: 032] statsite, wdq-mm: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211321 (owner: 10Dzahn) [23:50:35] (03PS1) 10Dzahn: puppet,puppet_compiler: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211322 [23:50:50] (03PS2) 10Dzahn: puppet,puppet_compiler: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211322 [23:53:31] (03PS1) 10Dzahn: kibana, labs_vmbuilder: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211323 [23:55:16] (03PS1) 10Dzahn: logstash: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211324 [23:55:38] (03PS2) 10Dzahn: kibana, labs_vmbuilder: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211323 [23:55:45] (03PS2) 10Dzahn: logstash: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211324 [23:56:40] (03CR) 10BryanDavis: [C: 031] logstash: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211324 (owner: 10Dzahn) [23:57:38] (03PS1) 10Dzahn: mediawiki: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211325 [23:57:55] (03PS2) 10Dzahn: mediawiki: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211325