[00:00:00] yup. I missed that woo [00:00:03] *too [00:00:10] revert or fix and merge? [00:00:18] fix and merge I think [00:00:31] you on the patch? [00:01:32] bd808|deploy, I don't have a solution yet [00:02:09] then we'd better revert and try again tomorrow [00:02:24] ok [00:02:53] (03PS1) 10BryanDavis: Revert "Add AffCom user group application contact page on meta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207328 [00:02:55] (03CR) 10jenkins-bot: [V: 04-1] Revert "Add AffCom user group application contact page on meta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207328 (owner: 10BryanDavis) [00:03:30] I guess you could pull mMessagePrefix with reflection and check that it's contactpage-affcomusergroup [00:03:47] (03PS2) 10BryanDavis: Revert "Add AffCom user group application contact page on meta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207328 [00:04:14] (03CR) 10BryanDavis: [C: 032] Revert "Add AffCom user group application contact page on meta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207328 (owner: 10BryanDavis) [00:04:20] (03Merged) 10jenkins-bot: Revert "Add AffCom user group application contact page on meta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207328 (owner: 10BryanDavis) [00:05:59] !log bd808 Synchronized wmf-config/CommonSettings.php: Revert of AffCom contact form {{gerrit|207328}} (duration: 00m 19s) [00:06:05] Logged the message, Master [00:06:43] Or you could do some hack involving adding a hidden field with a specific name (a random hash perhaps) and checking that it's set in mFieldData [00:07:24] hmmm... hidden field is easy to hack [00:07:39] we need the page name or something similar [00:07:39] !log bd808 Synchronized docroot/noc/createTxtFileSymlinks.sh: Revert of AffCom contact form {{gerrit|207328}} (duration: 00m 35s) [00:07:45] Logged the message, Master [00:09:14] bd808|deploy, I don't think it would depend on what the user sends [00:09:45] oh because it's on the php side? [00:09:48] right [00:09:52] *nod* [00:10:15] it would only happen to get sent to the user because we'd have to send it via the AdditionalFields [00:10:21] which is annoying [00:10:43] I can look at it tomorrow [00:10:51] but the alternative is more reflection [00:11:10] * bd808 declared swat {{done}} for tonight. [00:11:19] iptables: instead of whining like "omg, FATAL: Error inserting ip_tables.. iptables table `filter': Table does not exist (do you need to insmod?) .. Perhaps iptables or your kernel needs to be upgraded. [00:11:20] thanks bd808 :) [00:11:36] instead you could just say "got root?" to remind me [00:11:38] gwicke: I didn't do your patch because ... reasons... like you didn't answer the pings [00:11:59] greg-g: yw [00:12:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [00:13:18] (03CR) 10Dzahn: [C: 032] backup: remove fileset for contacts again [puppet] - 10https://gerrit.wikimedia.org/r/207281 (https://phabricator.wikimedia.org/T90679) (owner: 10Dzahn) [00:15:38] bd808: pong [00:16:09] heh [00:16:18] tin's open if you want to do your config patch [00:16:26] (03PS1) 10Alex Monk: [WIP] Revert "Revert "Add AffCom user group application contact page on meta"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 [00:16:29] * bd808 is late for dinner [00:16:40] bd808: kk, was afk for a bit [00:16:44] (03PS1) 10Yuvipanda: labs_lvm: remove quoting around -t param to mkfs [puppet] - 10https://gerrit.wikimedia.org/r/207333 [00:16:56] will do, unless greg-g has objections [00:17:10] gwicke: no worries. I was plenty busy [00:17:47] bd808: kk, thx! [00:18:07] (03CR) 10GWicke: [C: 032] Load HTML directly from RESTBase for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206319 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [00:18:13] (03Merged) 10jenkins-bot: Load HTML directly from RESTBase for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206319 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [00:20:38] !log gwicke Synchronized wmf-config/InitialiseSettings.php: VE: Load HTML directly from RESTBase for enwiki (duration: 00m 22s) [00:20:47] Logged the message, Master [00:22:34] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:23:34] gwicke: Thanks! [00:23:38] * greg-g looks in, looks away ;) [00:24:24] greg-g: I’m getting a consistent npm failure on the build for this patch (https://gerrit.wikimedia.org/r/#/c/207318/). Any idea who I should bug about that? [00:24:47] (03PS2) 10Alex Monk: Revert "Revert "Add AffCom user group application contact page on meta"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 [00:24:48] The error message is totally myserious to me [00:25:11] npm ERR! EEXIST, symlink '../esprima-harmony-jscs/bin/esparse.js' [00:25:49] kaldari: Known issue. It recurs always? [00:25:59] (03PS3) 10Alex Monk: Revert "Revert "Add AffCom user group application contact page on meta"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) [00:26:11] James_F: well, twice so far. Should I just keep trying? [00:26:33] how about this part " Previous build's commit, 9cc54158753e6011d16b8f176e34f496f5aac251, does not exist in the current repository." [00:26:36] is that normal? [00:26:47] kaldari: It normally fixes itself after one try; it's a race condition in upstream (esprima) for which a fix is being written, co-ordinated by Krinkle. [00:27:09] James_F: looking good [00:27:12] kaldari: generally, -releng ;) [00:30:41] kaldari: Being based on top of an abandoned outdated WIP patch probably doesn't help, though. :-) [00:31:16] James_F: I thought it wouldn’t matter, but maybe so :P [00:31:28] kaldari: It shouldn't, I agree. [00:31:31] kaldari: However. ;-) [00:31:41] James_F: Finally! 3rd times a charm! [00:31:47] :-) [00:32:13] (03CR) 10Tim Landscheidt: "I'll test a new iteration tomorrow." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [00:32:59] anyone have a sense if interWiki links follow $wgExternalLinkTarget ? [00:33:06] (03PS2) 10Yuvipanda: labs_lvm: remove quoting around -t param to mkfs [puppet] - 10https://gerrit.wikimedia.org/r/207333 [00:33:14] (03CR) 10Yuvipanda: [C: 032 V: 032] labs_lvm: remove quoting around -t param to mkfs [puppet] - 10https://gerrit.wikimedia.org/r/207333 (owner: 10Yuvipanda) [00:34:23] (03PS1) 10Dzahn: add codf db servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/207336 (https://phabricator.wikimedia.org/T96383) [00:36:49] (03CR) 10Dzahn: "rm self, been a long time without responses" [puppet] - 10https://gerrit.wikimedia.org/r/158023 (owner: 10Reedy) [00:37:55] !log xtrabackup clone db2028 to db2046 [00:38:11] Logged the message, Master [00:38:30] (03CR) 10Dzahn: "i wouldn't recommend it but it's been in my queue so long i'll abstain" [puppet] - 10https://gerrit.wikimedia.org/r/159167 (owner: 10RobH) [00:38:44] !log xtrabackup clone db2029 to db2047 [00:38:48] Logged the message, Master [00:39:31] (03PS1) 10Aude: Enable Wikibase usage tracking on frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207337 (https://phabricator.wikimedia.org/T96683) [00:39:33] (03CR) 10Dzahn: "i guess so, but it's a releng thing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187278 (https://phabricator.wikimedia.org/T369) (owner: 10Spage) [00:40:54] (03PS1) 10Ori.livneh: Introduce apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/207338 [00:41:20] (03CR) 10Dzahn: "adding gage and godog because it touches ipsec and swift" [puppet] - 10https://gerrit.wikimedia.org/r/195023 (owner: 10KartikMistry) [00:41:53] (03PS1) 10Yuvipanda: tools: Temporarily disable exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/207339 [00:42:05] (03PS2) 10Yuvipanda: tools: Temporarily disable exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/207339 [00:42:43] (03CR) 10Yuvipanda: [C: 04-2] "Cherry-pick testing on toolsbeta, don't actually merge" [puppet] - 10https://gerrit.wikimedia.org/r/207339 (owner: 10Yuvipanda) [00:46:20] mutante: "Previous build's commit" is normal yes. It's part of a component in Jenkins that can't be disabled. That component makes the classic assumption from the svn days that every commit being submitted to CI is based on the previous. However in our set up more often than not proposed commits come in out of order and don't share the same previous parent at all. E.g. rebase a pending commit fr [00:46:21] om last week right after merging a fresh one. [00:47:19] Krinkle: ok, got it. thanks for explaining [00:48:05] mutante: sure. Oh, and this effect is even more the case when different project re-use the same job (e.g. the "npm" job or the "phpunit" jobs are used by ve, mw, oojs, cdb etc. don't share any history) [00:48:09] (03CR) 10Dereckson: [C: 031] Enable GeoData at cawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199930 (https://phabricator.wikimedia.org/T93637) (owner: 10Gerardduenas) [00:49:39] (03CR) 10Dzahn: "adding more reviewers, rebase" [puppet] - 10https://gerrit.wikimedia.org/r/184637 (owner: 10Anomie) [00:49:45] (03PS3) 10Dzahn: Revert "Revert of Iab860b8a5: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php" [puppet] - 10https://gerrit.wikimedia.org/r/184637 (owner: 10Anomie) [00:50:39] AaronSchulz, what's a good way for me to trouble-shoot a possible master/slave/lag-related issue. [00:51:01] It fails in production, but works locally (even with the fake master/slave setup, but haven't tried fakeSlaveLag yet) and works on Beta Cluster. [00:51:23] (03CR) 10Yuvipanda: "Puppet is still disabled on all the exec hosts - this was primarily because we're futzing around with /tmp as well, and replacing /tmp in " [puppet] - 10https://gerrit.wikimedia.org/r/207304 (owner: 10Yuvipanda) [00:51:46] I'm not sure that it's caused by master/slave, but want to investigate. [00:52:05] (03PS4) 10Alex Monk: Revert "Revert "Add AffCom user group application contact page on meta"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) [00:52:29] (03PS1) 10GWicke: Revert "Add dummy "/preconnect" URL endpoint on restbase varnishes" [puppet] - 10https://gerrit.wikimedia.org/r/207341 (https://phabricator.wikimedia.org/T97500) [00:52:37] (03PS2) 10Dzahn: purge webrequest logs after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/197081 (owner: 10ArielGlenn) [00:57:25] (03CR) 10Dzahn: "looked good but it's been last year, so dunno" [puppet] - 10https://gerrit.wikimedia.org/r/177427 (https://phabricator.wikimedia.org/T71604) (owner: 10Yuvipanda) [00:57:51] (03Abandoned) 10Yuvipanda: base: Make number of days acct logs are kept customizable [puppet] - 10https://gerrit.wikimedia.org/r/177427 (https://phabricator.wikimedia.org/T71604) (owner: 10Yuvipanda) [01:00:35] (03CR) 10Dzahn: [C: 04-1] "could you do this minus the changes to the mediawiki module? thanks" [puppet] - 10https://gerrit.wikimedia.org/r/204626 (https://phabricator.wikimedia.org/T96431) (owner: 10Alex Monk) [01:01:26] (03CR) 10Dzahn: "+1 as well" [puppet] - 10https://gerrit.wikimedia.org/r/195444 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [01:02:38] (03CR) 10Dzahn: "i have a feeling this will wait until zirconium is replaced with ganeti VMs" [puppet] - 10https://gerrit.wikimedia.org/r/192827 (https://phabricator.wikimedia.org/T90676) (owner: 10John F. Lewis) [01:03:28] (03CR) 10Dzahn: "what Tim said, except, it's not merged yet" [puppet] - 10https://gerrit.wikimedia.org/r/111387 (owner: 10Jeremyb) [01:04:25] (03PS1) 10Ori.livneh: apt::init: use os_version instead of $::lsbdistid [puppet] - 10https://gerrit.wikimedia.org/r/207346 [01:04:42] (03CR) 10Dzahn: "i think we'll replace it with virtual machines instead" [dns] - 10https://gerrit.wikimedia.org/r/192828 (https://phabricator.wikimedia.org/T90676) (owner: 10John F. Lewis) [01:08:53] (03CR) 10Dzahn: "hey ariel, so why not just merge this and work on it from there? it's not like it's going to be used by jenkins yet, right" [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn) [01:09:16] (03PS2) 10Ori.livneh: apt::init: use os_version instead of $::lsbdistid [puppet] - 10https://gerrit.wikimedia.org/r/207346 [01:09:54] (03CR) 10Dzahn: "does module/shinken (alex) conflict with module/shinken (yuvi)? how to merge them into one?" [puppet] - 10https://gerrit.wikimedia.org/r/124861 (owner: 10Alexandros Kosiaris) [01:10:45] (03CR) 10Dzahn: "yet another bump" [puppet] - 10https://gerrit.wikimedia.org/r/145018 (owner: 10ArielGlenn) [01:10:50] (03CR) 10Tim Landscheidt: "Oh, yeah, don't know why I thought the other change has been merged." [puppet] - 10https://gerrit.wikimedia.org/r/111387 (owner: 10Jeremyb) [01:12:20] (03PS5) 10Alex Monk: Change BZ references to Phabricator tickets [puppet] - 10https://gerrit.wikimedia.org/r/204626 (https://phabricator.wikimedia.org/T96431) [01:13:52] (03PS3) 10Ori.livneh: apt::init: avoid using case statement [puppet] - 10https://gerrit.wikimedia.org/r/207346 [01:15:46] (03PS2) 10GWicke: Revert "Add dummy "/preconnect" URL endpoint on restbase varnishes" [puppet] - 10https://gerrit.wikimedia.org/r/207341 (https://phabricator.wikimedia.org/T97500) [01:16:10] (03CR) 10Ori.livneh: [C: 031] Revert "Add dummy "/preconnect" URL endpoint on restbase varnishes" [puppet] - 10https://gerrit.wikimedia.org/r/207341 (https://phabricator.wikimedia.org/T97500) (owner: 10GWicke) [01:16:59] (03PS6) 10Dzahn: Change BZ references to Phabricator tickets [puppet] - 10https://gerrit.wikimedia.org/r/204626 (https://phabricator.wikimedia.org/T96431) (owner: 10Alex Monk) [01:18:42] (03PS4) 10Ori.livneh: apt::init: avoid using case statement [puppet] - 10https://gerrit.wikimedia.org/r/207346 [01:18:51] (03CR) 10Ori.livneh: [C: 032 V: 032] apt::init: avoid using case statement [puppet] - 10https://gerrit.wikimedia.org/r/207346 (owner: 10Ori.livneh) [01:19:17] greg-g: i am ready to deploy https://gerrit.wikimedia.org/r/#/c/207337/ for frwikisource [01:19:29] unless there is some reason not to right now [01:19:46] (03CR) 10Dzahn: [C: 032] "thank you" [puppet] - 10https://gerrit.wikimedia.org/r/204626 (https://phabricator.wikimedia.org/T96431) (owner: 10Alex Monk) [01:22:21] probably will be ready for nlwiki soon also [01:22:54] (03PS2) 10Dzahn: add codfw db servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/207336 (https://phabricator.wikimedia.org/T96383) [01:28:42] (03PS1) 10Yuvipanda: tools: Create consistent 3x RAM swap on all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207354 [01:29:44] (03PS1) 10Alex Monk: Change BZ references to Phabricator tickets in MediaWiki module [puppet] - 10https://gerrit.wikimedia.org/r/207355 (https://phabricator.wikimedia.org/T96431) [01:29:47] (03PS2) 10Yuvipanda: tools: Create consistent 3x RAM swap on all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207354 [01:32:14] (03PS3) 10Yuvipanda: tools: Create consistent 3x RAM swap on all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207354 [01:34:28] (03PS4) 10Yuvipanda: tools: Create consistent 3x RAM swap on all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207354 [01:35:51] (03CR) 10Springle: [C: 031] add codfw db servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/207336 (https://phabricator.wikimedia.org/T96383) (owner: 10Dzahn) [01:37:34] (03PS5) 10Yuvipanda: tools: Create consistent 3x RAM swap on all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207354 [01:40:05] (03CR) 10Dzahn: [C: 031] Add ebernhardson to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/207133 (https://phabricator.wikimedia.org/T97332) (owner: 10John F. Lewis) [01:41:49] (03CR) 10Dzahn: "thanks @ottomata for confirming. added bblack for being on duty guy this week" [puppet] - 10https://gerrit.wikimedia.org/r/207133 (https://phabricator.wikimedia.org/T97332) (owner: 10John F. Lewis) [01:47:11] wikibugs: as usual? [01:47:52] (03PS2) 10Aude: Enable Wikibase usage tracking on frwikisource and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207337 (https://phabricator.wikimedia.org/T96683) [01:48:13] (03PS1) 10Ori.livneh: Fix .travis.yml test invocation command [debs/pybal] - 10https://gerrit.wikimedia.org/r/207357 [01:48:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix .travis.yml test invocation command [debs/pybal] - 10https://gerrit.wikimedia.org/r/207357 (owner: 10Ori.livneh) [01:54:45] (03PS6) 10Yuvipanda: tools: Create consistent 3x RAM swap on all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207354 [01:57:39] (03PS1) 10Ori.livneh: Don't omit 'test' [debs/pybal] - 10https://gerrit.wikimedia.org/r/207359 [01:57:47] (03CR) 10Ori.livneh: [C: 032 V: 032] Don't omit 'test' [debs/pybal] - 10https://gerrit.wikimedia.org/r/207359 (owner: 10Ori.livneh) [02:09:14] PROBLEM - puppet last run on mw1068 is CRITICAL Puppet has 1 failures [02:10:51] (03CR) 10Ori.livneh: [C: 031] tools: Create consistent 3x RAM swap on all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207354 (owner: 10Yuvipanda) [02:14:01] (03PS7) 10Yuvipanda: tools: Create consistent 3x RAM swap on all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207354 [02:15:23] (03PS8) 10Yuvipanda: tools: Create consistent 3x RAM swap on all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207354 (https://phabricator.wikimedia.org/T95979) [02:15:38] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Create consistent 3x RAM swap on all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207354 (https://phabricator.wikimedia.org/T95979) (owner: 10Yuvipanda) [02:17:48] (03Abandoned) 10Yuvipanda: WIP: Introducing Shinken module [puppet] - 10https://gerrit.wikimedia.org/r/124861 (owner: 10Alexandros Kosiaris) [02:18:57] (03Abandoned) 10Yuvipanda: tools: Temporarily disable exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/207339 (owner: 10Yuvipanda) [02:25:53] RECOVERY - puppet last run on mw1068 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [02:35:19] * aude is ready to deploy https://gerrit.wikimedia.org/r/#/c/207337/ :) [02:35:26] nlwiki + frwikisource [02:37:03] (03CR) 10Aude: [C: 032] Enable Wikibase usage tracking on frwikisource and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207337 (https://phabricator.wikimedia.org/T96683) (owner: 10Aude) [02:37:59] (03Merged) 10jenkins-bot: Enable Wikibase usage tracking on frwikisource and nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207337 (https://phabricator.wikimedia.org/T96683) (owner: 10Aude) [02:39:50] bah... l10nupdate [02:40:15] !log l10nupdate Synchronized php-1.26wmf2/cache/l10n: (no message) (duration: 25m 50s) [02:40:28] Logged the message, Master [02:40:36] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable Wikibase usage tracking on nlwiki and frwikisource (duration: 00m 12s) [02:40:42] Logged the message, Master [02:41:44] PROBLEM - MySQL Idle Transactions on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 605 seconds [02:42:03] PROBLEM - MySQL InnoDB on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 618 seconds [02:44:47] (03CR) 10Ottomata: [C: 04-1] "please wait on this, or sync up with analytics :)" [puppet] - 10https://gerrit.wikimedia.org/r/197081 (owner: 10ArielGlenn) [02:44:57] !log LocalisationUpdate completed (1.26wmf2) at 2015-04-29 02:43:54+00:00 [02:45:03] Logged the message, Master [02:46:53] RECOVERY - MySQL InnoDB on db1040 is OK longest blocking idle transaction sleeps for 0 seconds [02:48:14] RECOVERY - MySQL Idle Transactions on db1040 is OK longest blocking idle transaction sleeps for 0 seconds [02:48:16] !log killed eight stalled commonswiki.transcode transactions on db1040 [02:48:24] Logged the message, Master [02:57:48] (03PS1) 10BryanDavis: bd808: Add second ssh public key for personal laptop [puppet] - 10https://gerrit.wikimedia.org/r/207369 [02:59:07] (03PS2) 10BryanDavis: bd808: Add second ssh public key for personal laptop [puppet] - 10https://gerrit.wikimedia.org/r/207369 [03:16:39] (03CR) 10Yuvipanda: "I'll admit to not seeing this and writing https://gerrit.wikimedia.org/r/#/c/207354/ instead. sorry :|" [puppet] - 10https://gerrit.wikimedia.org/r/207306 (owner: 10coren) [03:23:13] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 4789.66068228 [03:28:33] (03PS1) 10Yuvipanda: labs_lvm: Mount swap partition so it sticks [puppet] - 10https://gerrit.wikimedia.org/r/207374 [03:32:32] (03PS2) 10Yuvipanda: labs_lvm: Mount swap partition so it sticks [puppet] - 10https://gerrit.wikimedia.org/r/207374 [03:36:27] (03CR) 10Yuvipanda: "I stole from this for https://gerrit.wikimedia.org/r/#/c/207354/ though. I did not know that I had to 'mount' and thus the swap was going " [puppet] - 10https://gerrit.wikimedia.org/r/207306 (owner: 10coren) [03:36:55] (03CR) 10Yuvipanda: "I meant I stole it for https://gerrit.wikimedia.org/r/#/c/207374/" [puppet] - 10https://gerrit.wikimedia.org/r/207306 (owner: 10coren) [03:37:24] (03PS3) 10Yuvipanda: labs_lvm: Mount swap partition so it sticks [puppet] - 10https://gerrit.wikimedia.org/r/207374 (https://phabricator.wikimedia.org/T95979) [03:38:11] (03CR) 10Yuvipanda: [C: 032 V: 032] labs_lvm: Mount swap partition so it sticks [puppet] - 10https://gerrit.wikimedia.org/r/207374 (https://phabricator.wikimedia.org/T95979) (owner: 10Yuvipanda) [03:40:13] !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 39m 55s) [03:40:25] Logged the message, Master [03:47:08] !log LocalisationUpdate completed (1.26wmf3) at 2015-04-29 03:46:05+00:00 [03:47:15] Logged the message, Master [04:00:16] (03PS1) 10Ori.livneh: apt: Use $::operatingsystem fact rather than $::lsbdistid [puppet] - 10https://gerrit.wikimedia.org/r/207376 [04:01:22] (03CR) 10Ori.livneh: [C: 032] apt: Use $::operatingsystem fact rather than $::lsbdistid [puppet] - 10https://gerrit.wikimedia.org/r/207376 (owner: 10Ori.livneh) [04:03:49] (03PS2) 10Ori.livneh: Introduce apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/207338 [04:06:44] (03PS1) 10Chad: WIP: Elastic: move merge_threads to hiera [puppet] - 10https://gerrit.wikimedia.org/r/207377 [04:08:42] <^d> That's the last of the ::elasticsearch class params out of the role::elasticsearch::server class and in hiera \o/ [04:10:29] <^d> Can you call a puppet class sexy? [04:10:43] <^d> Because role::elasticsearch::server is lean and sexy now [04:20:54] (03PS1) 10KartikMistry: CX: Add REST API support [puppet] - 10https://gerrit.wikimedia.org/r/207378 [04:21:37] (03CR) 10jenkins-bot: [V: 04-1] CX: Add REST API support [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [04:30:38] (03PS1) 10Yuvipanda: tools: Make webservice read default server service manifest [puppet] - 10https://gerrit.wikimedia.org/r/207380 (https://phabricator.wikimedia.org/T94788) [04:35:31] (03PS2) 10KartikMistry: CX: Add REST API support [puppet] - 10https://gerrit.wikimedia.org/r/207378 [04:50:27] (03PS3) 10KartikMistry: CX: Add REST API support [puppet] - 10https://gerrit.wikimedia.org/r/207378 [04:58:25] (03PS3) 10Springle: add codfw db servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/207336 (https://phabricator.wikimedia.org/T96383) (owner: 10Dzahn) [04:59:13] (03CR) 10Springle: [C: 032] add codfw db servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/207336 (https://phabricator.wikimedia.org/T96383) (owner: 10Dzahn) [05:04:32] (03PS1) 10Springle: deploy db2043 s3, db2044 s4, db2045, s5, db2046 s6, db2047 s7 [puppet] - 10https://gerrit.wikimedia.org/r/207386 [05:05:28] (03CR) 10Springle: [C: 032] deploy db2043 s3, db2044 s4, db2045, s5, db2046 s6, db2047 s7 [puppet] - 10https://gerrit.wikimedia.org/r/207386 (owner: 10Springle) [05:28:23] !log tstarling Synchronized php-1.26wmf3/extensions/SecurePoll: (no message) (duration: 00m 13s) [05:28:31] Logged the message, Master [05:53:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [06:13:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [06:17:44] PROBLEM - puppet last run on cp3021 is CRITICAL Puppet has 1 failures [06:23:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [06:29:54] PROBLEM - puppet last run on db1042 is CRITICAL Puppet has 1 failures [06:29:55] PROBLEM - puppet last run on lvs2004 is CRITICAL Puppet has 1 failures [06:30:05] PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures [06:30:58] <_joe_> morning [06:31:16] PROBLEM - puppet last run on mw1092 is CRITICAL Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on cp4001 is CRITICAL puppet fail [06:31:34] RECOVERY - Router interfaces on cr2-eqiad is OK host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 [06:33:04] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [06:34:45] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 2 failures [06:35:35] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 2 failures [06:35:55] RECOVERY - puppet last run on cp3021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:36:44] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:37:05] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:46:15] RECOVERY - puppet last run on mw1092 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on db1042 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:35] RECOVERY - puppet last run on lvs2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:45] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:45] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:46] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:05] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:15] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:45] (03PS2) 10Giuseppe Lavagetto: hiera/nuyaml: remove dynamic lookups [puppet] - 10https://gerrit.wikimedia.org/r/207127 [06:48:24] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera/nuyaml: remove dynamic lookups [puppet] - 10https://gerrit.wikimedia.org/r/207127 (owner: 10Giuseppe Lavagetto) [06:49:44] RECOVERY - puppet last run on cp4001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:52] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As stated elsewhere, I don't really like the idea of using the role keyword on single machines." [puppet] - 10https://gerrit.wikimedia.org/r/206025 (owner: 10Dzahn) [06:56:34] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [07:12:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Apr 29 07:11:38 UTC 2015 (duration 11m 37s) [07:12:48] Logged the message, Master [07:14:54] RECOVERY - Router interfaces on cr2-eqiad is OK host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 [07:20:14] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [07:30:14] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [07:43:35] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [07:53:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [08:38:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [08:47:45] (03CR) 10Filippo Giunchedi: [C: 031] Syntax: Fixed yaml [puppet] - 10https://gerrit.wikimedia.org/r/195023 (owner: 10KartikMistry) [08:48:45] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [09:04:54] (03PS3) 10Filippo Giunchedi: statsite: improve restart [puppet] - 10https://gerrit.wikimedia.org/r/206819 [09:05:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: improve restart [puppet] - 10https://gerrit.wikimedia.org/r/206819 (owner: 10Filippo Giunchedi) [09:36:45] PROBLEM - puppet last run on mw1121 is CRITICAL Puppet has 1 failures [09:44:06] (03PS2) 10Filippo Giunchedi: statsite: enable extended counters by default [puppet] - 10https://gerrit.wikimedia.org/r/206781 (https://phabricator.wikimedia.org/T95703) [09:44:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: enable extended counters by default [puppet] - 10https://gerrit.wikimedia.org/r/206781 (https://phabricator.wikimedia.org/T95703) (owner: 10Filippo Giunchedi) [09:45:28] (03CR) 10Alexandros Kosiaris: [C: 031] CX: Add REST API support [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [09:45:48] (03PS2) 10Giuseppe Lavagetto: hiera: Add a proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/207128 [09:49:28] (03CR) 10Mobrovac: [C: 04-1] "Little URL issue in-lined." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [09:53:25] RECOVERY - puppet last run on mw1121 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:54:04] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [09:54:41] <_joe_> too many creates? [09:54:55] <_joe_> godog: what is this ^^ and how can I look at it properly? [09:56:51] _joe_: expected after https://gerrit.wikimedia.org/r/206781 but it'd be /var/log/upstart/carbon_cache [09:57:07] <_joe_> godog: ok [09:57:12] <_joe_> thanks :) [09:58:00] np, forgot to silence afterwards [09:58:26] ACKNOWLEDGEMENT - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] Filippo Giunchedi effects from https://gerrit.wikimedia.org/r/206781 [10:04:15] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:06:48] (03CR) 10Mobrovac: [C: 04-1] "LGTM, minor issues only." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/207128 (owner: 10Giuseppe Lavagetto) [10:07:35] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [10:12:05] (03PS2) 10Filippo Giunchedi: Revert "eventlogging: adjust counters thresholds" [puppet] - 10https://gerrit.wikimedia.org/r/206797 (https://phabricator.wikimedia.org/T95703) [10:12:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "eventlogging: adjust counters thresholds" [puppet] - 10https://gerrit.wikimedia.org/r/206797 (https://phabricator.wikimedia.org/T95703) (owner: 10Filippo Giunchedi) [10:21:05] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [10:30:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [10:33:28] akosiaris, hi, could you add me as an admin to github's wikimedia pls [10:34:27] akosiaris, I need to migrate https://github.com/nyurik/osm2pgsql-osm-bright.tm2source [10:39:08] yurik: me ? I think I don't have the rights for this I think [10:39:39] akosiaris, mobrovac said you are an admin there [10:40:11] https://github.com/wikimedia [10:40:21] I am ? [10:40:49] that's what mobrovac said :D [10:41:21] yup, you're an owner akosiaris [10:42:05] *akosiaris* realizes he is the king of the world... and wasn't told about it... :) [10:42:43] can't say I 've ever used that right [10:42:45] <_joe_> I don't think adding people as admins should be done like this [10:43:05] i don't need "adminship", just the right to accept pull reqs [10:43:39] make sure MaxSem is also added [10:43:45] <_joe_> on that specific repo? [10:43:55] <_joe_> that's matter for a phab ticket I guess [10:44:33] yurik, I see most repos say "please go via gerrit" on their READMEs [10:44:58] well, descriptions but you get the diea [10:44:59] idea* [10:45:38] akosiaris, we are going same as services path - will experiment and share with the community, and later migrate to gerrit once we have something stable [10:45:52] (same as other services projects) [10:46:04] hasn't this been discussed to the death 3 years ago ? [10:46:13] i was'nt part of that :D [10:46:17] neither was I [10:46:20] talk to gwicke [10:46:36] haha [10:46:39] I 'd rather avoid a philosophical debate [10:46:42] <_joe_> I think we should talk to the whole engineering /community/ [10:46:51] <_joe_> about all this [10:47:00] for something that has been discussed to the death before I was here [10:47:03] ooooh boy [10:47:37] <_joe_> relying on a commercial, closed source platform. Yes we must. [10:47:45] <_joe_> I already said so to gabriel btw [10:47:45] afaik we should switch to diphusion [10:47:56] _joe_, afaik, github is oos, isn't it? [10:48:00] oss [10:48:03] <_joe_> yurik: wat? [10:48:06] <_joe_> no it isn't [10:48:07] nope, it's not [10:48:15] gitlab is, github is not [10:48:24] <_joe_> if you want a github like flow, you use gitlab or gogs [10:49:17] the reason for github in services was travis providing support for cassandra [10:49:27] iirc [10:50:29] ? [10:50:45] I am not sure I follow [10:51:01] ok, sounds like i should remove the request for now :) Never mind, not worth it at the moment :) [10:51:20] akosiaris, i do want to talk to you about postgis db [10:51:26] is it back up? [10:52:14] (and we should migrate our stuff to gerrit, while cloning it to github for easy access/contribution for other devs) [10:52:38] yurik: IIRC putting in gerrit will create a copy in github anyway [10:54:25] yurik: as far as the postgres db goes, it's still in the same state. I need to sync up with yuvi and get account creation working [10:54:30] YuviPanda: ping me when online [10:54:57] MaxSem, ^ [10:55:12] PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds [10:55:20] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [10:56:51] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [11:38:31] (03Abandoned) 10Alexandros Kosiaris: WIP: Move ulsfo to ganglia_new module [puppet] - 10https://gerrit.wikimedia.org/r/185937 (owner: 10Alexandros Kosiaris) [11:45:06] (03PS1) 10Alexandros Kosiaris: Add the LVS blocks to url_downloader [puppet] - 10https://gerrit.wikimedia.org/r/207419 [12:04:22] (03CR) 10Giuseppe Lavagetto: "thanks for the suggestions." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/207128 (owner: 10Giuseppe Lavagetto) [12:18:32] (03PS3) 10Giuseppe Lavagetto: hiera: Add a proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/207128 (https://phabricator.wikimedia.org/T93776) [12:29:33] oh joy, phab labs spam [12:31:07] (03PS1) 10Faidon Liambotis: smokeping: remove mr1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/207427 [12:31:30] (03CR) 10Faidon Liambotis: [C: 032 V: 032] smokeping: remove mr1-codfw [puppet] - 10https://gerrit.wikimedia.org/r/207427 (owner: 10Faidon Liambotis) [12:40:58] paravoid: Sorry, should be done now: https://phabricator.wikimedia.org/T89270 [12:41:15] just mass-filter [12:41:56] :) [12:42:10] chasemp had a way of doing those with db queries that resulted in no notifications, iirc [12:42:23] that kind of volume is manageable, though, so I don't mind :) [12:42:29] directly on the SQL DB, yeah :D [12:42:56] I guess I should get shell access and more comfortable screwing up things with my limited SQL skills. :P [13:03:32] !log disabling netflows on cr1/2-ulsfo [13:03:39] Logged the message, Master [13:06:28] (03CR) 10Hashar: [C: 031] contint: Use device=none in tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/204542 (owner: 10Krinkle) [13:08:59] (03PS1) 10KartikMistry: Beta: CX: Add languages for Deployment on 20150430 [puppet] - 10https://gerrit.wikimedia.org/r/207433 [13:12:01] PROBLEM - puppet last run on cp3003 is CRITICAL puppet fail [13:19:42] andre__: can you create a tag "Global renaming" in phab, for easy tracking of gr related issues/bugs. [13:21:42] Steinsplitter, https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects [13:23:15] thanks, will look at the link [13:29:00] RECOVERY - puppet last run on cp3003 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [13:37:24] (03PS2) 10Andrew Bogott: Add a couple of settings to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/205979 [13:37:26] (03PS1) 10Andrew Bogott: Depool labvirt1005 because it's throwing memory errors [puppet] - 10https://gerrit.wikimedia.org/r/207437 [13:42:47] (03CR) 10Andrew Bogott: [C: 032] Depool labvirt1005 because it's throwing memory errors [puppet] - 10https://gerrit.wikimedia.org/r/207437 (owner: 10Andrew Bogott) [13:43:30] Coren: fyi ^ [13:43:46] * Coren nods. [13:57:35] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: CX: Add languages for Deployment on 20150430 [puppet] - 10https://gerrit.wikimedia.org/r/207433 (owner: 10KartikMistry) [14:00:04] chasemp: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150429T1400). Please do the needful. [14:00:33] phabbot gone again? [14:06:45] bblack: some labs instance has issue and it happens to host instances used by the irc bots [14:23:05] (03PS8) 10Chad: Hiera-ize the mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/204331 [14:24:04] (03PS2) 10Chad: Elastic: move auto_create_index into hiera instead of role [puppet] - 10https://gerrit.wikimedia.org/r/207140 [14:34:53] hashar: ah, I guess labs issues doe raise their ugly heads once every year or three :) [14:35:11] bblack: it is more like once per month :] [14:35:26] week? [14:35:33] (03PS4) 10Alexandros Kosiaris: WIP: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:35:37] * bblack hides from the wrath of Coren [14:35:45] but given the amount of instances it supports, all the hardware and the fairly complicated stack, I am surprised we barely had any whole scale outage of doom yet [14:36:42] bblack: this one is a hardware memory failure. [14:36:50] It’s uncanny [14:38:32] overall I am very happy with labs infra :] [14:39:11] my biggest kudos goes to coren who replaced GlusterFS with a central NFS server. That made it way more reliable [14:45:51] can somebody merge this: https://gerrit.wikimedia.org/r/206083 [14:46:10] (03CR) 10Alexandros Kosiaris: [C: 032] Graphoid: Varnish configuration [puppet] - 10https://gerrit.wikimedia.org/r/206108 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [14:46:46] hashar: bblack: yeah, sometimes it looks like Labs has a curse - we do get a surprising amount of freak fails and hardware issues; and we are hit surprisingly often by the odd kernel bug. [14:47:00] VIrtualization does tend to tickle kernel bugs often, admitedly. [14:48:00] But also we got freak things happen like a effing *cable* failure on DAS - which is in the list of "thinks you gotta check when you have an issue but nobody *ever* expects is the problem for real because that never happens" [14:50:15] mobrovac: https://graphoid.wikimedia.org/_info [14:50:22] graphoid is live [14:50:28] gwicke, aude: Ping for SWAT in 10 minutes [14:50:34] anomie: pong [14:51:12] akosiaris: yay! [14:51:19] akosiaris: try this: https://graphoid.wikimedia.org/_info/home :) [14:51:21] yey [14:52:07] FYI I'm running through some unassigned high/ubn ops tasks and giving them owners. If you don't feel like you should be owning whatever it is, either ping me or pass the hot potatoe to someone else that's more appropriate! [14:52:31] Coren: when I used to work for an ISP noc, the first thing we did were to check the equipment hard power , cables connected with green/orange whatever lights on or blinking. [14:52:35] Coren: just to be sure :] [14:53:51] (03PS5) 10Alexandros Kosiaris: WIP: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [14:55:52] anyone up for a task related to setting up some kind of no-reply bouncer address for [[Special:EmailUser]] ? https://phabricator.wikimedia.org/T66795 [14:56:07] I know, it's email and therefore it's beneath the bottom priority of everyone [14:57:03] (03PS1) 10Mobrovac: service::node: Add the list of domains for which not to use the proxy [puppet] - 10https://gerrit.wikimedia.org/r/207454 (https://phabricator.wikimedia.org/T97530) [14:59:16] akosiaris, _joe_: if you guys could take a look at https://gerrit.wikimedia.org/r/#/c/207454/ i'd appreciate it :) [15:00:00] phuedx: Going to prepare the submodule update patches for your SWAT extension updates? [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, gwicke, aude, anomie: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150429T1500). Please do the needful. [15:00:11] * anomie begins SWAT [15:00:14] anomie: on it now [15:00:16] gwicke: You're first [15:00:17] * aude here [15:00:21] will update when i've got 'em [15:00:34] (03PS4) 10Anomie: Load HTML directly from RESTBase on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206320 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [15:00:41] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206320 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [15:00:44] anomie: ok [15:00:46] yay [15:01:07] sorry to bother but can someone take a look at patch https://gerrit.wikimedia.org/r/206083 and https://gerrit.wikimedia.org/r/207430 these has to be merged before any other deploys on site creation. [15:02:22] (03PS1) 10Faidon Liambotis: autoinstall: switch to Debian stable for udebs [puppet] - 10https://gerrit.wikimedia.org/r/207457 [15:03:46] this is a swat case also: https://gerrit.wikimedia.org/r/206080 [15:04:48] Mjbmr: If you want something SWATted, add it to the list on https://wikitech.wikimedia.org/wiki/Deployments [15:05:10] (03Merged) 10jenkins-bot: Load HTML directly from RESTBase on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206320 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [15:05:29] sorry [15:05:56] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Load HTML directly from RESTBase on all wikipedias [[gerrit:206320]] (duration: 00m 17s) [15:05:56] gwicke: ^ Test please [15:06:01] it was not described on [[SWAT deploys]] tho. [15:06:03] Logged the message, Master [15:06:41] (03CR) 10Faidon Liambotis: [C: 032] autoinstall: switch to Debian stable for udebs [puppet] - 10https://gerrit.wikimedia.org/r/207457 (owner: 10Faidon Liambotis) [15:06:49] anomie: looks good, thanks! [15:07:11] anomie: You're next [15:07:20] (for the config change) [15:07:25] (03PS2) 10Anomie: Remove sampling of api.log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206865 (https://phabricator.wikimedia.org/T88393) [15:07:31] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206865 (https://phabricator.wikimedia.org/T88393) (owner: 10Anomie) [15:07:37] (03Merged) 10jenkins-bot: Remove sampling of api.log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206865 (https://phabricator.wikimedia.org/T88393) (owner: 10Anomie) [15:07:51] PROBLEM - configured eth on ganeti1001 is CRITICAL: tap0 reporting no carrier. [15:08:27] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove sampling of api.log [[gerrit:206865]] (duration: 00m 29s) [15:08:28] anomie: ^ Test please [15:08:33] Logged the message, Master [15:08:48] anomie: By the way `tail api.log` is moving, looks like it worked [15:08:56] aude: You're next [15:09:01] ok [15:09:37] (03CR) 10Anomie: [C: 04-1] Add abusefilter-modify-restricted right to sysop user group for idwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206080 (https://phabricator.wikimedia.org/T96542) (owner: 10Mjbmr) [15:10:03] anomie: gerrit links are updated to related submodule updates [15:10:12] phuedx: Thanks [15:10:30] anomie: The RESTbase change looks good to me. [15:11:14] James_F: Thanks. gwicke said it looked good too [15:11:26] * James_F nods. Thank you again. :-) [15:12:05] * gwicke reloads the graphs [15:12:19] http://grafana.wikimedia.org/#/dashboard/db/visualeditor-load-save [15:14:04] Mjbmr: In case you didn't notice, I saw a problem with your https://gerrit.wikimedia.org/r/#/c/206080/ [15:14:20] no I did, thanks [15:14:52] (03PS2) 10Mjbmr: Add abusefilter-modify-restricted right to sysop user group for idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206080 (https://phabricator.wikimedia.org/T96542) [15:14:56] done [15:15:31] (03CR) 10Anomie: [C: 031] Add abusefilter-modify-restricted right to sysop user group for idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206080 (https://phabricator.wikimedia.org/T96542) (owner: 10Mjbmr) [15:19:50] (03PS1) 10Ottomata: Mirror cdh5.4.0 in apt [puppet] - 10https://gerrit.wikimedia.org/r/207462 (https://phabricator.wikimedia.org/T97453) [15:20:22] (03CR) 10Ottomata: [C: 032 V: 032] Mirror cdh5.4.0 in apt [puppet] - 10https://gerrit.wikimedia.org/r/207462 (https://phabricator.wikimedia.org/T97453) (owner: 10Ottomata) [15:20:47] !log anomie Synchronized php-1.26wmf3/extensions/Wikidata: SWAT: Update Wikidata - fix change subscriptions script [[gerrit:207448]] (duration: 00m 53s) [15:20:48] aude: ^ Test please [15:20:51] ok [15:20:57] Logged the message, Master [15:20:57] akosiaris: puppet disabled on carbon? [15:21:06] anomie: You're next for your core change [15:21:24] ottomata: yes, me [15:21:28] seems good [15:21:41] ok, don't mind really, i'll manually apply the change i just merged, just 3 characters in updates :) [15:21:51] ottomata: debugging https://phabricator.wikimedia.org/T97481 [15:22:13] oh, good to know, thank you, i was going to try to install oxygen with jessie today [15:22:54] ottomata: you will fail then ;-) [15:23:00] ha, ok [15:23:03] i will wait! [15:24:01] PROBLEM - Apache HTTP on mw1232 is CRITICAL - Socket timeout after 10 seconds [15:24:30] PROBLEM - HHVM rendering on mw1232 is CRITICAL - Socket timeout after 10 seconds [15:24:36] (03CR) 10Chad: "Well snap. I'm trying the compiler but it's failing due to rack detection." [puppet] - 10https://gerrit.wikimedia.org/r/207140 (owner: 10Chad) [15:24:40] PROBLEM - HHVM rendering on mw1132 is CRITICAL - Socket timeout after 10 seconds [15:24:50] PROBLEM - Apache HTTP on mw1132 is CRITICAL - Socket timeout after 10 seconds [15:24:51] * ^d stabs puppet [15:24:55] <^d> *STAB STAB STAB* [15:26:50] PROBLEM - HHVM queue size on mw1232 is CRITICAL 66.67% of data above the critical threshold [80.0] [15:27:20] PROBLEM - HHVM busy threads on mw1132 is CRITICAL 75.00% of data above the critical threshold [86.4] [15:27:21] PROBLEM - HHVM busy threads on mw1232 is CRITICAL 75.00% of data above the critical threshold [115.2] [15:27:38] ^ number of new problems directly related to number of stabs [15:28:01] PROBLEM - HHVM queue size on mw1132 is CRITICAL 44.44% of data above the critical threshold [80.0] [15:29:52] ^ what's going on here? [15:30:11] stabs puppet <-> HHVM queue issues? [15:30:30] !log anomie Synchronized php-1.26wmf3/includes/api/ApiResult.php: SWAT: API: ApiResult must validate even when using numeric auto-indexes [[gerrit:207456]] (duration: 00m 26s) [15:30:31] anomie: ^ Test please [15:30:36] Logged the message, Master [15:30:38] anomie: Works! [15:30:43] phuedx: You're next [15:31:06] bueller? [15:31:15] anomie: cool [15:32:18] anyone with a possible correlation between SWAT + HHVM alerts? [15:32:37] <^d> bblack: My stabbing was unrelated afaik. [15:32:47] <^d> s/afaik// [15:33:34] the ApiResult one was deployed after the alerts [15:33:46] bblack: The change merged before that was aude's Wikidata change, but that was 4 minutes earlier. [15:33:58] s/merged/deployed/ [15:34:25] 15:05 < grrrit-wm> (Merged) jenkins-bot: Load HTML directly from RESTBase on all wikipedias [mediawiki-config] - https://gerrit.wikimedia.org/r/206320 (https://phabricator.wikimedia.org/T95229) (owner: GWicke) [15:34:29] that perhaps? [15:34:58] gwicke, James_F: ^ ? [15:35:05] HHVM load? [15:35:12] If anything, the RB change should lower HHVM load. [15:35:18] *nod* [15:35:24] https://gerrit.wikimedia.org/r/#/c/206320/4/wmf-config/InitialiseSettings.php [15:35:31] (Quite a lot as a %age of VE's impact, in fact.) [15:35:38] it shifts traffic away from the PHP API [15:36:00] but those load rates are really low in any case [15:36:08] < 2/s [15:36:12] then why are they critical alerts? :) [15:36:41] oh you mean VE traffic -> API rates in general I think [15:36:43] the cause might be a different one [15:37:18] I'm just looking for an obvious correlation. HHVM queue/threads alert during/near code deploys, one would think wouldn't be a coincidence [15:37:53] What else changed? [15:39:08] <_joe_> bblack: the cause is probably something like some locking during stat_cache() [15:39:11] <_joe_> lemme look [15:39:55] SWATs before the HHVM messages: Restbase, api.log at 1:1 instead of 1:1000, wikidata. [15:40:48] <_joe_> !log restarting HHVM on mw1232, stuck on __lll_lock_wait from HPHP::StatCache::refresh () [15:40:56] Logged the message, Master [15:41:12] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 66241 bytes in 0.245 second response time [15:41:22] so just general buggy interaction with filesystem updates of php code files? [15:41:24] <_joe_> it's like the APC contention bug, the HHVM version [15:41:24] just a HHVM bug triggered by code changing underneath HHVM? [15:41:45] <_joe_> they implemented the runtime bugs as well [15:41:50] <_joe_> "parity" FTW [15:41:54] !log anomie Synchronized php-1.26wmf3/extensions/MobileFrontend: SWAT: MobileFrontend: API: "editable" is a legacy boolean, don't convert it [[gerrit:207403]] (duration: 00m 37s) [15:41:55] phuedx: ^ Test please [15:41:56] bug-for-bug [15:41:59] Logged the message, Master [15:42:31] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.179 second response time [15:43:14] <_joe_> !log restarting HHVM on mw1132 too, same reason. [15:43:20] Logged the message, Master [15:43:35] anomie: lgtm [15:43:39] FlorianSW: are you testing too? [15:43:58] phuedx: one minute :) [15:44:08] Deskana: yt? [15:44:15] !log anomie Synchronized php-1.26wmf2/extensions/MobileFrontend: SWAT: MobileFrontend: API: "editable" is a legacy boolean, don't convert it [[gerrit:207403]] (duration: 00m 23s) [15:44:16] phuedx: ^ Test to make sure it didn't break wmf2, please [15:44:20] Logged the message, Master [15:44:23] Mjbmr: You're next [15:44:30] (03PS2) 10Mobrovac: service::node: Add the list of domains for which not to use the proxy [puppet] - 10https://gerrit.wikimedia.org/r/207454 (https://phabricator.wikimedia.org/T97530) [15:44:41] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 66249 bytes in 0.448 second response time [15:44:51] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.093 second response time [15:45:04] phuedx: Request: GET http://en.wikipedia.org/w/api.php?action=mobileview&page=Manchester&prop=editable, from 10.20.0.112 via cp1067 cp1067 ([10.64.0.104]:3128), Varnish XID 1965371274 :/ WMF Error page [15:45:12] ok, what should I do? [15:45:24] ^ anomie: verified -- wmf2 is broken with that change [15:45:25] Mjbmr: Be ready to test it after I deploy it [15:45:35] ok [15:45:50] phuedx, anomie testwiki works fine [15:45:51] PROBLEM - Disk space on dbstore1002 is CRITICAL: DISK CRITICAL - free space: / 1325 MB (3% inode=96%) [15:46:03] anomie: as FlorianSW says, testwiki is fine [15:46:11] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [15:47:20] !log anomie Synchronized php-1.26wmf2/extensions/MobileFrontend: SWAT: Revert previous, broke stuff on wmf2 (duration: 00m 39s) [15:47:21] phuedx: Reverted. Working again on wmf2? [15:47:25] Logged the message, Master [15:47:40] RECOVERY - HHVM busy threads on mw1232 is OK Less than 30.00% above the threshold [76.8] [15:47:56] phuedx: You rang? [15:48:10] anomie: checking now [15:48:31] wait what [15:48:35] why was it backported to wmf2? [15:48:36] phuedx, anomie nope, maybe a varnish problem? [15:48:41] RECOVERY - HHVM queue size on mw1232 is OK Less than 30.00% above the threshold [10.0] [15:48:43] it only neeeds to be on wmf3 [15:48:50] it'll fatal on wmf2 [15:48:54] i thought it needed to be wmf3 and 2? [15:49:22] anomie, legoktm: that was mibad -- i'll revert the changes to wmf2 [15:49:27] Ooh, unit test missed the fatal ;) [15:49:47] <_joe_> we had a slight increase of load on the API cluster [15:49:49] legoktm: it really only needed to be wmf3? [15:49:55] yes [15:50:01] because the new ApiResult code is in wmf3 [15:50:06] which is why the app isn't broken right now [15:50:16] <_joe_> http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1430321917&g=network_report&z=large [15:50:17] but would be whenever the train rolls on to wikipedias [15:50:33] <_joe_> it seems to be going down though [15:50:49] gotcha [15:50:54] Oh dear. The app is totally broken right now. [15:50:55] in that case it was my misunderstanding [15:51:10] Deskana: ? it was reverted [15:51:31] RECOVERY - HHVM queue size on mw1132 is OK Less than 30.00% above the threshold [10.0] [15:51:37] !log anomie Synchronized php-1.26wmf2/extensions/MobileFrontend: SWAT: Resync? (duration: 00m 36s) [15:51:40] ottomata: fixed, puppet on carbon enabled and ran. [15:51:44] Logged the message, Master [15:51:46] legoktm, phuedx, anomie, Deskana still error page for api.php on enwiki [15:52:30] RECOVERY - HHVM busy threads on mw1132 is OK Less than 30.00% above the threshold [57.6] [15:53:01] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [15:53:10] PROBLEM - configured eth on ganeti1001 is CRITICAL: tap0 reporting no carrier. [15:53:25] !log anomie Synchronized php-1.26wmf2/extensions/MobileFrontend: SWAT: Ah, git rebasing was rebasing the reverted commits on top of the revert... (duration: 00m 21s) [15:53:31] Logged the message, Master [15:53:39] :| [15:53:42] FlorianSW, phuedx, etc: Should work now [15:53:45] okay, https://en.wikipedia.org/w/api.php?action=mobileview&page=Manchester&prop=editable is good now [15:53:47] Deskana: ^ [15:53:58] anomie: yap, thanks :) [15:54:05] (03PS3) 10Anomie: Add abusefilter-modify-restricted right to sysop user group for idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206080 (https://phabricator.wikimedia.org/T96542) (owner: 10Mjbmr) [15:54:11] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206080 (https://phabricator.wikimedia.org/T96542) (owner: 10Mjbmr) [15:54:54] (03Merged) 10jenkins-bot: Add abusefilter-modify-restricted right to sysop user group for idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206080 (https://phabricator.wikimedia.org/T96542) (owner: 10Mjbmr) [15:55:31] Thank you, all. [15:55:35] !log anomie Synchronized wmf-config/abusefilter.php: SWAT: Add abusefilter-modify-restricted right to sysop user group for idwiki [[gerrit:206080]] (duration: 00m 25s) [15:55:37] Mjbmr: ^ Test please [15:55:40] Logged the message, Master [15:56:08] anomie: works [15:56:10] so, API load went up a bunch at 15:44 also [15:56:38] Mjbmr: You snuck another one in there? [15:56:50] yeah [15:57:11] (03PS1) 10Alexandros Kosiaris: Ignore tap interfaces in check_eth [puppet] - 10https://gerrit.wikimedia.org/r/207469 [15:57:33] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206093 (https://phabricator.wikimedia.org/T96824) (owner: 10Mjbmr) [15:57:39] (03Merged) 10jenkins-bot: Enable assigning 'accountcreator' for newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206093 (https://phabricator.wikimedia.org/T96824) (owner: 10Mjbmr) [15:58:38] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable assigning "accountcreator" for newiki [[gerrit:206093]] (duration: 00m 30s) [15:58:40] Mjbmr: ^ Test please [15:58:45] Logged the message, Master [15:58:54] hey we're still looking at https://gdash.wikimedia.org/dashboards/reqerror/ [15:59:08] I'm assuming the 5xx megaspike was from the wmf2/wmf3 confusion/breakage? [15:59:08] <_joe_> I find worrying both the 5xx spike [15:59:10] RECOVERY - Disk space on dbstore1002 is OK: DISK OK [15:59:11] yeah [15:59:14] <_joe_> and the 404 spike too [15:59:18] <_joe_> which is ongoing [15:59:18] 223529 error: Couldn't find constant ApiResult::META_BC_BOOLS in /srv/mediawiki/php-1.26wmf2/extensions/MobileFrontend/includes/api/ApiMobileView.php on line 233 [15:59:22] anomie: works [15:59:36] <_joe_> since ~ 15:10 UTC [15:59:41] bblack: If it was only on wikipedias, probably [15:59:43] (03PS1) 10KartikMistry: Enable Content Translation for Deployment 20150430 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207472 (https://phabricator.wikimedia.org/T97540) [15:59:59] <_joe_> what may be causing this 404 shower? [16:00:28] Re 5xx: And only for requests to api.php with action=mobileview, probably [16:00:30] (03PS1) 10KartikMistry: CX: Add languages for Deployment on 20150430 [puppet] - 10https://gerrit.wikimedia.org/r/207473 (https://phabricator.wikimedia.org/T97540) [16:00:37] * anomie is done with SWAT [16:01:08] (03CR) 10KartikMistry: [C: 04-1] "Only merge after, https://gerrit.wikimedia.org/r/207472 tomorrow! :)" [puppet] - 10https://gerrit.wikimedia.org/r/207473 (https://phabricator.wikimedia.org/T97540) (owner: 10KartikMistry) [16:01:25] <_joe_> anomie: and the 404s? [16:01:35] <_joe_> https://graphite.wikimedia.org/render/?title=HTTP%204xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.4xx,%224xxx%20resp/min%22)),%22blue%22) [16:01:54] <_joe_> I'll take a look in varnish [16:02:06] _joe_: No idea about 404s [16:03:37] _joe_: In the timeframe of the 404 spike, Restbase at 15:05 and api.log to 1:1 instead of 1:1000 at 15:08. [16:04:16] <_joe_> anomie: uhm "POST /_preconnect HTTP/1.1" [16:04:29] <_joe_> what the heck is this? [16:04:56] gwicke, James_F: ^ Could Restbase have caused the 404 spike, or that POST? [16:05:18] <_joe_> I see this going to the appservers [16:05:29] Yes, RB emits 404s for non-extant pages per the spec. [16:05:39] heh [16:05:51] so our _preconnect perf hack is intentionally generating 404s? :) [16:05:51] <_joe_> James_F: by POSTing to an inexistent url? [16:06:05] _joe_: Shouldn't? [16:06:25] https://phabricator.wikimedia.org/T97500 [16:06:31] preconnect was an anti-latency hack from before the $domain entrypoint [16:06:46] (to have the browser connect up to rest.wm.o ahead of user need) [16:06:50] <_joe_> bblack: anti-latency for... what? [16:06:56] <_joe_> oh, I see [16:07:07] <_joe_> but why is that going to the appservers now? [16:07:15] I have no idea [16:07:25] gwicke: ^ [16:07:27] <_joe_> I see that on the HHVM servers gwicke [16:07:28] it's hardcoded in VE to hit /_preconnect at whatever the RB host is [16:07:35] see the patches [16:07:36] <_joe_> oh, my [16:07:37] then fix it? [16:07:41] or revert that change [16:07:55] https://phabricator.wikimedia.org/T97500 [16:08:23] the VE patch to remove this is https://gerrit.wikimedia.org/r/#/c/207345/ [16:08:38] <_joe_> gwicke: well that should probably have been released before this? [16:08:44] <_joe_> dunno just suggesting :) [16:09:16] a deploy just went out and is causing a graph to explode, either hotfix it or revert that change [16:09:25] <_joe_> 60K more reqs/min to the appservers for an inexistent url seems really bad [16:09:42] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:10:08] I would be totally on board with merging & deploying this quickly [16:10:23] but ultimately it's up to the VE team [16:10:55] as a temporary measure, we could add a hack that returns quickly for /_preconnect [16:11:00] anomie: please revert [16:11:01] <_joe_> ok, so let's ask them and revert if they're not ok with that, no? [16:11:09] basically https://gerrit.wikimedia.org/r/#/c/207341/ [16:11:11] in reverse [16:11:33] let me prepare that [16:11:42] paravoid: Error, no context. [16:11:43] heh [16:12:00] yes, please put back the parsoid-varnish /preconnect thingy [16:12:14] if that's going to stop this, that's the obvious fix. It shouldn't have been pulled before the related VE code died [16:12:52] oh, it seem to still be there [16:12:58] am I lost in a sea of reverts? [16:13:15] bblack: it never was part of *text-lb* [16:13:16] <_joe_> bblack: no one reverted the restbase-related change [16:13:27] <_joe_> oh you mean the hack [16:13:28] <_joe_> right [16:13:34] <_joe_> it's not on text-lb [16:13:51] yeah but this all still goes RB->parsoid->API right? [16:13:58] <_joe_> and between rolling back and hot-patching text-lb, I know what I'd do [16:13:58] no [16:14:00] oh, these go direct, right [16:14:06] 19:07 < gwicke> it's hardcoded in VE to hit /_preconnect at whatever the RB host is [16:14:10] <_joe_> bblack: the client requests /_preconnect to en.wikipedia.org [16:14:17] folks, this is unacceptable [16:14:21] <_joe_> and not to restbase directly [16:14:54] are there any VE folks around? [16:15:03] James_F, RoanKattouw_away ? [16:15:08] edsanders? [16:15:08] <_joe_> anomie: the change to revert is I think https://gerrit.wikimedia.org/r/#/c/206320 [16:15:15] gwicke: how is "revert" so hard to understand? [16:15:19] hi [16:15:35] <_joe_> when/if that's deccided [16:15:37] I'm working on the varnish workaround in parallel in any case [16:16:20] (03PS1) 10BBlack: block off the _preconnect spam from VE [puppet] - 10https://gerrit.wikimedia.org/r/207476 [16:16:24] (03PS1) 10Faidon Liambotis: Revert "Load HTML directly from RESTBase for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207477 [16:16:26] (03PS1) 10GWicke: Temporary hack to return VE's /_preconnect requests quickly [puppet] - 10https://gerrit.wikimedia.org/r/207478 [16:16:29] (03CR) 10jenkins-bot: [V: 04-1] Revert "Load HTML directly from RESTBase for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207477 (owner: 10Faidon Liambotis) [16:16:43] * anomie sees three different patches trying to fix it and decides to stay out of it [16:17:03] gee thanks [16:17:23] <_joe_> guys, please don't mess with text-varnish [16:17:54] varnish is not the best place to solve this AFAICT [16:17:54] <_joe_> srsly, no one will die if we wait for VE to be ready when we deploy this config change [16:18:20] <_joe_> and honestly, changing the vcl of text is always full of potential side effects [16:18:20] I'm not trying to solve it, I'm just wanting to stop the flow of the spam in the present [16:18:30] since other fixes/reverts do not seem to be making forward progress still heh [16:18:31] I'm fine with rolling back the config change & wait for the VE folks to fix the _preconnect thing [16:19:41] seems like a lot of passivity going on here, who's going to step up to fix it? [16:19:49] I am. [16:19:58] thanks paravoid [16:20:01] <_joe_> paravoid: are you rebasing that change? [16:20:06] I'm trying to revert/rebase properly [16:20:13] <_joe_> ok [16:20:29] paravoid: it's a single line, might be easier to create a new patch [16:20:31] want me to? [16:20:44] paravoid: Looks like you reverted the wrong change there. [16:20:49] I think the most-recent change was actually enwiki => wikipedias [16:20:53] yup [16:21:04] (03PS1) 10Anomie: Revert "Load HTML directly from RESTBase on all wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207484 [16:21:10] <_joe_> greg-g: I see people trying to find out the best way to solve a non-outage-like problem, tbh [16:21:12] (03PS2) 10Anomie: Revert "Load HTML directly from RESTBase on all wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207484 [16:21:16] <_joe_> not passivity :) [16:21:18] (03CR) 10Anomie: [C: 032] Revert "Load HTML directly from RESTBase on all wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207484 (owner: 10Anomie) [16:21:24] (03Merged) 10jenkins-bot: Revert "Load HTML directly from RESTBase on all wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207484 (owner: 10Anomie) [16:21:37] (03PS2) 10Faidon Liambotis: Revert "Load HTML directly from RESTBase on all wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207477 [16:21:40] (03PS7) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) [16:21:46] oh too late? [16:21:50] <_joe_> yep [16:21:52] by a hair [16:21:54] _joe_: I might have misunderstood you/faidon's directness as "fix this now", it seemed like all response were "meh" [16:22:04] gwicke, not my area of expertise, maybe Krenair / Krinkle / RoanKattouw_away will know [16:22:06] !log anomie Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 30s) [16:22:06] There. [16:22:10] thanks. [16:22:13] Logged the message, Master [16:22:13] ? [16:22:22] (03CR) 10coren: [C: 031] "This version debugged and tested to work on labstore2001" [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [16:22:32] Krenair: https://gerrit.wikimedia.org/r/#/c/207345/ [16:22:48] VE is causing a lot of 404s with that _preconnect code [16:22:57] <_joe_> greg-g: eheh no it was to fix now, but I figured everyone was trying to find the best way to do it [16:23:02] (03Abandoned) 10Faidon Liambotis: Revert "Load HTML directly from RESTBase on all wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207477 (owner: 10Faidon Liambotis) [16:23:24] anomie: thanks [16:23:40] anomie: thx! [16:23:44] I suspect we're still having 404s from mediawikiwiki, testwikis, and enwiki due to this, but we should at least be returning to the baseline level from just before 15:05 today. [16:23:51] yeah [16:23:55] why was this deployed first on enwiki? [16:23:56] <_joe_> anomie: yes [16:24:00] there was a lot of confusion involved in this whole conversation in general :) [16:24:09] gwicke, approved [16:24:11] bblack: I think yeah :) [16:24:12] enwiki is a small wiki in terms of VE use [16:24:20] paravoid: it used to be all, then they pulled back to just enwiki on some theory about latencies [16:24:30] (that most enwiki users were closer to the eqiad-only RB service) [16:24:34] still, why aren't we going by the regular train group? [16:24:44] <_joe_> of course the 404s are still there [16:24:51] <_joe_> until people reload VE at least [16:24:54] <_joe_> right? [16:24:54] from enwiki, and were before the most-recent deploy [16:25:04] _joe_: it'll be five minutes until the VE cache times out [16:25:04] we just got more of it with this [16:25:13] <_joe_> gwicke: nod [16:25:43] paravoid: it's a gradual roll-out; group 0 is enabled as well [16:26:17] <_joe_> I'm off for now, I'll be back later probably [16:26:24] (am I right about this? did this at one point get deployed everywhere, then a graph looked ugly, then dropped back again, then back out again after the new entry point, etc?) [16:26:28] cya _joe_ [16:26:33] bblack: yes [16:26:55] but enwiki stayed enabled from the first time, or was the first to be enabled in round 2? [16:27:07] first to be enabled in round two [16:27:13] oh [16:27:17] gwicke: I think the question is why not follow the same ordering of group wikis as the train, which I think is based on the different distribution of VE usage? [16:27:23] group 0 always remained enabled [16:27:25] I see the 404 graph dropping already [16:28:00] <_joe_> yes [16:28:01] greg-g: yes, that's my understanding [16:28:01] <_joe_> :) [16:28:23] gwicke: who made the decision (it sounds like not you from that response) [16:28:41] greg-g: James_F [16:28:57] gotcha, ty [16:28:57] it's common for VE deploys, afaik [16:29:03] * greg-g nods [16:40:03] * James_F returns from meetings. [16:43:10] (03CR) 10Alexandros Kosiaris: [C: 032] Ignore tap interfaces in check_eth [puppet] - 10https://gerrit.wikimedia.org/r/207469 (owner: 10Alexandros Kosiaris) [16:43:31] (03Abandoned) 10BBlack: block off the _preconnect spam from VE [puppet] - 10https://gerrit.wikimedia.org/r/207476 (owner: 10BBlack) [16:43:47] (03PS1) 10Alexandros Kosiaris: Revert "Add the LVS blocks to url_downloader" [puppet] - 10https://gerrit.wikimedia.org/r/207489 [16:43:55] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL - elasticsearch http://10.64.0.162:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [16:44:00] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1245981 (10RobH) I've also created a blocked task T97545 for the reinstallation/update of the logstash1001-1003 hosts. [16:44:34] PROBLEM - puppet last run on logstash1005 is CRITICAL Puppet has 1 failures [16:44:44] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "Add the LVS blocks to url_downloader" [puppet] - 10https://gerrit.wikimedia.org/r/207489 (owner: 10Alexandros Kosiaris) [16:45:24] PROBLEM - ElasticSearch health check for shards on logstash1006 is CRITICAL - elasticsearch http://10.64.48.109:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [16:46:14] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL - elasticsearch http://10.64.16.185:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [16:46:35] PROBLEM - puppet last run on logstash1006 is CRITICAL Puppet has 1 failures [16:48:15] RECOVERY - puppet last run on logstash1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:49:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Using that list is error-prone is bound to be out of date. An alternative proposal is to rely on network::constants::all_networks which co" [puppet] - 10https://gerrit.wikimedia.org/r/207454 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac) [16:50:58] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Add the LVS blocks to url_downloader"" [puppet] - 10https://gerrit.wikimedia.org/r/207490 [16:52:25] RECOVERY - configured eth on ganeti1001 is OK - interfaces up [16:52:45] RECOVERY - puppet last run on logstash1005 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:52:46] (03PS1) 10Dereckson: Logo configuration on ur.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207491 (https://phabricator.wikimedia.org/T97510) [16:54:59] hi [16:55:00] https://commons.wikimedia.org/wiki/File:Mus%C3%A9e-Arles-Mosa%C3%AFques-Orph%C3%A9e.jpg [16:55:05] ??? [16:55:14] 0 × 0 pixels, file size: 3.72 MB [16:56:26] lol [17:01:43] (03CR) 10Mjbmr: [C: 031] Create Wikipedia Konkani [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [17:03:09] (03CR) 10Dereckson: [C: 04-1] "Community-consensus-needed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207491 (https://phabricator.wikimedia.org/T97510) (owner: 10Dereckson) [17:11:11] (03CR) 10Gage: [C: 031] Syntax: Fixed yaml [puppet] - 10https://gerrit.wikimedia.org/r/195023 (owner: 10KartikMistry) [17:18:54] I have a question about i18n, what´s the better approach nowadays? Setup separate databases per language, or use Extension, or both? [17:28:56] 6operations, 6Scrum-of-Scrums, 10incident-20150410-flowdataloss, 7database: Better backup coverage for X1 database cluster - https://phabricator.wikimedia.org/T95835#1246144 (10Mattflaschen) [17:29:35] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums, 10incident-20150410-flowdataloss, 7database: Better backup coverage for X1 database cluster - https://phabricator.wikimedia.org/T95835#1201504 (10Mattflaschen) [17:30:47] 6operations, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1246149 (10Cmjohnson) [17:30:48] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1246147 (10Cmjohnson) 5Open>3Resolved Added these to racktables...resolving [17:31:25] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [17:32:16] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1246150 (10RobH) Does anyone currently know the procedure for doing this these days? How much advance notice do we need to give, it will back up mailing list delivery for a short time? Wha... [17:33:48] 6operations, 10Wikimedia-Mailing-lists: scrub non-free PDF from list archives - https://phabricator.wikimedia.org/T95195#1246153 (10RobH) I'll also email the ops list regarding this, and see what my fellow opsen suggest. [17:36:02] robh: ^ would it be possible to just use .htaccess to http/500 that specific file? [17:37:10] sounds messier than just deleting it [17:37:15] i may be able to do without downtime [17:37:28] and then we still have the copyrighted material on the server, just blocked [17:37:39] which is still an issue (i think) [17:38:04] valhallasw`cloud: Was wikibugs sleeping last night? I created a new Phab ticket for wm-bot and it didn't post in #wm-bot [17:38:20] T13|mobile: redis woes, I think. I rebooted it and it's working now. [17:38:46] :) [17:39:03] though it annoys me to realize that bug wasnt set to private [17:39:08] as i dont want that link published, fixed task. [17:41:09] robh: hm, I guess that makes sense. Still risky with mailman and renumbering messages (although I think *clearing* messages is OK, it's just *removing* messages that breaks stuff) [17:41:21] its a file attachment, not a mesage [17:41:28] so i can just rm it in arcvhive, which im about to do [17:41:35] but honestly.. i dont think it will break, but im not sure. [17:41:43] it should just give a broken link to the file. [17:42:30] the attachments are not in the mbox file? interesting [17:42:46] heh, seems to ahve worked [17:42:59] file no longer exists, but arvhives still load... [17:43:40] 6operations, 6Security, 10Wikimedia-General-or-Unknown, 7Mail: DMARC: Users cannot send emails via a wiki's [[Special:EmailUser]] - https://phabricator.wikimedia.org/T66795#1246186 (10ori) I added a no-reply@wikimedia.org EXIM alias, routed to `:blackhole:`. [17:43:45] ^ bblack [17:46:46] renoirb: #mediawiki-i18n [17:47:01] thx ori [17:47:23] (03CR) 10Dereckson: Allow to backup globalimagelinks table, T87571 (031 comment) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/200313 (owner: 10Kelson) [17:52:57] 7Puppet, 6Multimedia, 6Reading-Infrastructure-Team, 6Release-Engineering, and 4 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1246211 (10bd808) [17:59:43] robh: mailman is awkward with attachements. I always learned and used 'it will regen' but I'm unsure if it will in this version. I'm testing it on labs now to confirm either way [17:59:57] cool, thx for figuring it out! [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150429T1800). Please do the needful. [18:00:12] I pulled the entire directory contents for the attachments directory for that date, but nothign in mbox [18:01:03] robh: okay. fyi mailman3 doesn't do this, is safer and more nice so :) [18:02:06] branching MediaWiki extensions for 1.26wmf4 [18:11:12] (03PS2) 10Ori.livneh: Use require_package for python-redis [puppet] - 10https://gerrit.wikimedia.org/r/202093 (owner: 10Gergő Tisza) [18:11:22] (03CR) 10Ori.livneh: [C: 032 V: 032] Use require_package for python-redis [puppet] - 10https://gerrit.wikimedia.org/r/202093 (owner: 10Gergő Tisza) [18:13:42] (03Abandoned) 10GWicke: Temporary hack to return VE's /_preconnect requests quickly [puppet] - 10https://gerrit.wikimedia.org/r/207478 (owner: 10GWicke) [18:14:10] 6operations, 10OCG-General-or-Unknown: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#1246320 (10cscott) I'll take a look, thanks for bringing it to my attention. [18:22:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [18:24:09] 6operations, 10Citoid, 10Graphoid, 6Mobile-Apps, and 3 others: SCA services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#1246355 (10mobrovac) p:5Unbreak!>3Normal @akosiaris reverted the LVS IP block in https://gerrit.wikimedia.org/r/#/c/207489 until we come up with... [18:27:11] twentyafterfour: did you see https://gerrit.wikimedia.org/r/#/c/207459/ or do we need to update the submodule? [18:27:59] aude I did not see that .. [18:28:20] :/ [18:28:21] ok [18:28:25] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1246366 (10demon) wikimediavsnsa.org isn't all that short or easy to remember. Plus having a new domain has its own problems entirely (SSL certs, do you also buy the .coms, what about typo variants, etc?) Wha... [18:28:40] I can fix [18:28:48] ok, thanks [18:28:56] * aude needs to remember to add you as reviewer [18:34:03] (03CR) 10Ori.livneh: [C: 031] WIP: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [18:36:43] aude: for that change it'll be ok to just merge it yourself [18:36:53] I always do git pull before branching [18:36:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [18:37:16] (03PS2) 10Dzahn: Syntax: Fixed yaml [puppet] - 10https://gerrit.wikimedia.org/r/195023 (owner: 10KartikMistry) [18:37:19] twentyafterfour: ok [18:37:31] but we need a better process for this. I don't think we should be using this branching script at all, [18:37:32] * aude will merge it so it's there next week [18:37:34] PROBLEM - Disk space on dbstore1002 is CRITICAL: DISK CRITICAL - free space: / 1110 MB (3% inode=96%) [18:37:38] yeah :/ [18:38:06] (03CR) 10Dzahn: [C: 032] Syntax: Fixed yaml [puppet] - 10https://gerrit.wikimedia.org/r/195023 (owner: 10KartikMistry) [18:38:15] or rather - it should just be smart enough to pick up whatever is the newest branch at the time, and add that as a submodule [18:38:32] so that the script doesn't have the branch hard coded [18:38:42] would be nice :) [18:40:12] (03CR) 10Dzahn: "this patch is true wiki, it has my name but all i did was add the logo line and now it's the full config :) thanks all" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [18:40:48] (03PS2) 10Ottomata: Set up kafkatee instance on oxygen for ops webrequest log debugging [puppet] - 10https://gerrit.wikimedia.org/r/207166 (https://phabricator.wikimedia.org/T96616) [18:42:46] (03CR) 10Ottomata: [C: 032] Set up kafkatee instance on oxygen for ops webrequest log debugging [puppet] - 10https://gerrit.wikimedia.org/r/207166 (https://phabricator.wikimedia.org/T96616) (owner: 10Ottomata) [18:43:04] (03CR) 10Dzahn: "it wasn't about style, it was about enabling role-based hiera lookup for those classes per https://wikitech.wikimedia.org/wiki/Puppet_Hier" [puppet] - 10https://gerrit.wikimedia.org/r/206036 (owner: 10Dzahn) [18:44:58] (03PS1) 10Ottomata: Can't fully qualify a resource default tag [puppet] - 10https://gerrit.wikimedia.org/r/207513 [18:45:21] (03CR) 10Ottomata: [C: 032 V: 032] Can't fully qualify a resource default tag [puppet] - 10https://gerrit.wikimedia.org/r/207513 (owner: 10Ottomata) [18:45:30] (03PS3) 10Dzahn: purge webrequest logs after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [18:46:34] hey mutante, maybe you know, but what is the gmond service called in jessie puppet stuff? [18:46:35] (03PS8) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) [18:46:37] (03PS1) 10coren: Labs: reconfigure LDAP to be sane on labstores [puppet] - 10https://gerrit.wikimedia.org/r/207514 [18:46:38] ganglia-monitor? [18:46:42] Could not find dependent Service[gmond] [18:47:41] (03PS1) 10Catrope: Don't use /api endpoint for RESTbase in labs, it doesn't work (404s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207516 [18:47:47] PROBLEM - puppet last run on oxygen is CRITICAL puppet fail [18:47:55] aude: https://gerrit.wikimedia.org/r/#/c/207515/ is correct? [18:48:27] (03PS2) 10coren: Labs: reconfigure LDAP to be sane on labstores [puppet] - 10https://gerrit.wikimedia.org/r/207514 (https://phabricator.wikimedia.org/T95559) [18:48:34] ottomata: hmm, that just happened as a result of a change? [18:48:55] ottomata: is that on oxygen? [18:49:00] yes [18:49:09] first time applying kafkatee in jessie, brand new install [18:49:15] all it had before was include standard [18:49:25] i'm looking in puppet for where gmond service is defined [18:49:30] not sure atm which it uses [18:49:31] ganglia_new? [18:49:31] does it stay like this after another puppet run? [18:49:43] wherever that service is [18:49:46] it should probably do [18:49:48] alias => 'gmond' [18:49:50] ottomata: that depends if it is set in hiera to use old or new [18:49:51] and yes, it stays [18:50:00] let's check that .. hold on [18:51:04] (03PS1) 10Dzahn: oxygen -> ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/207517 [18:51:06] ottomata: ^ [18:51:33] let's do this, we have to use the host-based thing for now [18:51:40] hm, it'll still fail though, no? [18:51:49] it will make it use another module [18:51:55] the ganglia new module just calls the service 'ganglia-monitor' [18:51:56] not gmond [18:52:06] i can just change kafkatee to subscribe to Service ganglia-monitor [18:52:30] oo, actually,i shoudl probably remove that oxygen.yaml file ! :) don't need udp2log-users anymore [18:52:34] yea, i dont know about the difference in jessie.. [18:52:35] twentyafterfour: looks ok [18:52:52] mutante: it isnt' really a jessie thing, it just looks like our puppet used to call the service gmond [18:53:00] and now both places call it ganglia-monitro [18:53:17] PROBLEM - Disk space on dbstore1002 is CRITICAL: DISK CRITICAL - free space: / 1337 MB (3% inode=96%) [18:53:19] (03PS1) 10Catrope: Disable direct RESTbase access in labs for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207518 [18:53:32] gwicke`: https://gerrit.wikimedia.org/r/207516 https://gerrit.wikimedia.org/r/207518 [18:53:37] (03PS1) 10Ottomata: Notify ganglia-monitor for kafkatee monitoring [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/207519 [18:53:38] gwicke`: TLDR everything is broken :( [18:53:50] (03PS2) 10Ottomata: Notify ganglia-monitor for kafkatee monitoring [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/207519 [18:53:55] (03CR) 10Ottomata: [C: 032 V: 032] Notify ganglia-monitor for kafkatee monitoring [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/207519 (owner: 10Ottomata) [18:54:57] (03PS2) 10Catrope: Don't use /api endpoint for RESTbase in labs, it doesn't work (404s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207516 (https://phabricator.wikimedia.org/T97558) [18:54:59] (03PS1) 10Ottomata: Update kafkatee module with ganglia-monitor service name fix [puppet] - 10https://gerrit.wikimedia.org/r/207521 [18:55:01] (03PS2) 10Catrope: Disable direct RESTbase access in labs for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207518 [18:55:20] (03CR) 10Ottomata: [C: 032 V: 032] Update kafkatee module with ganglia-monitor service name fix [puppet] - 10https://gerrit.wikimedia.org/r/207521 (owner: 10Ottomata) [18:57:05] (03CR) 10jenkins-bot: [V: 04-1] Don't use /api endpoint for RESTbase in labs, it doesn't work (404s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207516 (https://phabricator.wikimedia.org/T97558) (owner: 10Catrope) [18:57:11] ottomata: that depends if it is set in hiera to use old or new [18:57:21] ottomata: wrong paste [18:57:27] ottomata: ok, wouldn't you want ganglia_new regardless of that [18:57:50] dunno, do I? [18:57:51] :) [18:58:01] 10Ops-Access-Requests, 6operations, 10CirrusSearch, 6Search-Team: James Douglas sudo on Elasticsearch machines - https://phabricator.wikimedia.org/T97559#1246473 (10Manybubbles) 3NEW [18:58:21] 10Ops-Access-Requests, 6operations, 10CirrusSearch, 6Search-Team: James Douglas sudo on Elasticsearch machines - https://phabricator.wikimedia.org/T97559#1246483 (10Manybubbles) [18:58:44] (03PS1) 10Ottomata: Fix another instance of gmond to ganglia-monitor [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/207524 [18:58:57] (03CR) 10Ottomata: [C: 032 V: 032] Fix another instance of gmond to ganglia-monitor [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/207524 (owner: 10Ottomata) [18:59:29] (03PS1) 10Ottomata: Update kafkatee module with another gmond/ganglia-monitor service name fix [puppet] - 10https://gerrit.wikimedia.org/r/207525 [18:59:37] (03CR) 10Ottomata: [C: 032 V: 032] Update kafkatee module with another gmond/ganglia-monitor service name fix [puppet] - 10https://gerrit.wikimedia.org/r/207525 (owner: 10Ottomata) [19:03:20] (03Abandoned) 10Dzahn: allow role-based hiera lookup on terbium [puppet] - 10https://gerrit.wikimedia.org/r/206025 (owner: 10Dzahn) [19:03:44] (03Abandoned) 10Dzahn: antimony: include both roles with role keyword [puppet] - 10https://gerrit.wikimedia.org/r/206027 (owner: 10Dzahn) [19:03:49] (03Abandoned) 10Dzahn: argon: use role keyword for hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/206028 (owner: 10Dzahn) [19:04:22] (03Abandoned) 10Dzahn: bast4001: use role keyword for all roles [puppet] - 10https://gerrit.wikimedia.org/r/206029 (owner: 10Dzahn) [19:04:44] (03Abandoned) 10Dzahn: horizon: use role keyword on californium [puppet] - 10https://gerrit.wikimedia.org/r/206031 (owner: 10Dzahn) [19:04:48] (03Abandoned) 10Dzahn: carbon: use role keyword to include installserver [puppet] - 10https://gerrit.wikimedia.org/r/206032 (owner: 10Dzahn) [19:04:58] (03Abandoned) 10Dzahn: eventlog1001: include all roles with role keyword [puppet] - 10https://gerrit.wikimedia.org/r/206036 (owner: 10Dzahn) [19:05:14] (03Abandoned) 10Dzahn: gallium: use role keyword to include roles [puppet] - 10https://gerrit.wikimedia.org/r/206038 (owner: 10Dzahn) [19:05:18] (03Abandoned) 10Dzahn: helium: use role keyword to include roles [puppet] - 10https://gerrit.wikimedia.org/r/206045 (owner: 10Dzahn) [19:05:48] (03Abandoned) 10Dzahn: graphite nodes: use role keyword to include roles [puppet] - 10https://gerrit.wikimedia.org/r/206046 (owner: 10Dzahn) [19:06:33] (03CR) 10Catrope: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207516 (https://phabricator.wikimedia.org/T97558) (owner: 10Catrope) [19:08:17] PROBLEM - DPKG on oxygen is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:09:42] (03CR) 10Dzahn: [C: 031] Change BZ references to Phabricator tickets in MediaWiki module [puppet] - 10https://gerrit.wikimedia.org/r/207355 (https://phabricator.wikimedia.org/T96431) (owner: 10Alex Monk) [19:10:00] (03PS5) 10BryanDavis: Add AffCom user group application contact page on meta (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [19:15:53] robh: deleted/overwritten attachments *are* regenerated and I've also successfully just made an attachment go completely from my labs install [19:16:30] so, where you want to take things from here depends on what I do. If you want me to throw the instructions in the ticket for removing the attachment totally or not [19:16:36] JohnFLewis: cool, can you document on the task how to do it (i'll end up copying the instructions over to wikitech with surrounding documentation later) [19:16:37] RECOVERY - DPKG on oxygen is OK: All packages OK [19:16:38] heh [19:16:44] i think we're on the same page ;D [19:16:56] will do now then [19:17:04] thank you for figuring it out, its appreciated! [19:17:26] I'm doing a documentation update sprint on Friday (so I'll include this in that) [19:17:37] for partman and phabricator and a few other things i have notes on from the past week [19:22:50] bblack, can I not use upstart on jessie? [19:24:37] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:24:41] (03Abandoned) 10coren: Labs: support explicit labs_lvm::swap class [puppet] - 10https://gerrit.wikimedia.org/r/207306 (owner: 10coren) [19:26:14] (03CR) 10BryanDavis: [C: 031] "Needs SWAT or other deploy window." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [19:27:57] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [19:29:03] robh: replied [19:29:56] reviewed, that makes sense to me [19:30:10] I'll have to schedule the downtime for the list server as well [19:30:20] Yeah [19:30:45] this shouldn't affect message ids but may affect attachment ids but those are dynamically changed in the archives [19:31:51] so the file being gone without regen means it can be scheduled, since its no longer accessible until the regen [19:31:58] but srsly, glad you caught the regen thing cuz that would have sucked [19:34:38] as regens are so rare, a quick 'kill the file live' is a great emergency response but is not the solution [19:35:02] since a full removal is more complicated, the hack is ideal pending an actual fix (like most things) :) [19:36:23] chasemp: hey! have a moment? [19:37:13] (03PS1) 10Aude: Enable use of subscriptions table on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207535 [19:43:17] (03CR) 10Anomie: Add AffCom user group application contact page on meta (take 2) (037 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [19:44:21] (03PS1) 10Yuvipanda: tools: Enable hba automatically for exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/207537 [19:44:22] Coren: ^ thoughts? [19:44:41] YuviPanda: Why not hiera? [19:45:02] Coren: because we don’t have a clean way to apply based on role in labs yet. [19:45:16] yannf: here [19:45:19] Ah, right, the role bit doesn't work there. :-( [19:45:20] Coren: so ideally we’d set it to be true for everything except for the ones in infrastructure role [19:45:30] Coren: but we can’t really do that. [19:45:31] YuviPanda: here :) [19:45:52] Well no, it's okay to have hba=yes in infrastructure. [19:46:10] Coren: don’t we want hba only for the exec nodes? [19:46:20] chasemp: hi! switching channels... [19:46:45] It's *necessary* for the exec nodes, but not having to forward keys to tools is useful even for us. [19:49:09] (03PS1) 10Dzahn: dumps: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/207542 [19:51:34] (03CR) 10Dzahn: [C: 032] dumps: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/207542 (owner: 10Dzahn) [19:54:47] (03PS1) 1020after4: Remove 1.25wmf22 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207544 [19:54:49] (03PS1) 1020after4: Add 1.26wmf4 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207545 [19:54:51] (03PS1) 1020after4: Wikipedias to 1.26wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207546 [19:54:53] (03PS1) 1020after4: Group0 to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207547 [19:56:10] (03CR) 1020after4: [C: 032] Remove 1.25wmf22 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207544 (owner: 1020after4) [19:56:15] (03CR) 1020after4: [C: 032] Add 1.26wmf4 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207545 (owner: 1020after4) [19:56:17] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:57:57] (03Merged) 10jenkins-bot: Remove 1.25wmf22 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207544 (owner: 1020after4) [19:57:59] (03Merged) 10jenkins-bot: Add 1.26wmf4 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207545 (owner: 1020after4) [19:58:03] (03CR) 10Tim Landscheidt: "I don't know anything about the NFS part, but for rsync the uids do not need to exist when --numeric-ids is used:" [puppet] - 10https://gerrit.wikimedia.org/r/207514 (https://phabricator.wikimedia.org/T95559) (owner: 10coren) [19:58:22] (03PS2) 10Ori.livneh: Handle 204s consistently across Varnish roles [puppet] - 10https://gerrit.wikimedia.org/r/206351 [19:59:15] ottomata: you cannot [19:59:42] bd808, you're planning to handle anomie's comments right? [19:59:48] aye, thanks, will have to figure out systemd for kafkatee now :) [20:00:10] Krenair: yeah. he and I will see what we can fix up [20:00:25] bblack: https://gerrit.wikimedia.org/r/#/c/206351/ [20:00:31] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150429T2000). [20:00:42] thanks [20:02:09] ottomata: use service_unit [20:02:21] in puppet? [20:02:48] i'm trying to install kafkatee in jessie, rebuilt the package for jessie, buuuuut, it has an upstart init file, so i need to make a systemd one, ja? [20:02:53] * twentyafterfour is still deploying, running long as usual [20:02:55] yes [20:02:58] aye, will do [20:03:03] bblack, got a good example for me to work from? [20:03:04] !log ori Synchronized README: testing deploy script (duration: 00m 25s) [20:03:08] something we already ahve written? [20:03:13] Logged the message, Master [20:03:32] yeah service_unit is a puppet thing to semi-abstract sysvinit/upstart/systemd, but not at the actual initscript/file level, just at the puppet level [20:03:48] !log ori Synchronized README: testing deploy 2 (duration: 00m 22s) [20:03:54] Logged the message, Master [20:04:09] as for unit file examples: [20:04:10] bblack-mba:puppet bblack$ find . -name *.systemd.erb [20:04:10] ./modules/base/spec/fixtures/templates/initscripts/nginx.systemd.erb [20:04:13] ./modules/memcached/templates/initscripts/memcached.systemd.erb [20:04:16] ./modules/varnish/templates/initscripts/varnish.systemd.erb [20:04:18] ./modules/varnishkafka/templates/initscripts/varnishkafka.systemd.erb [20:04:22] oh in puppet. [20:04:24] hm, ok. [20:04:35] (03CR) 10Dzahn: [C: 032] dumps: improve nginx disk utilisation via directio [puppet] - 10https://gerrit.wikimedia.org/r/190940 (owner: 10Ori.livneh) [20:04:44] why are those in puppet, and not the packages? [20:04:50] i should put this into the jessie kafkatee package, no? [20:05:40] ideally, yes [20:05:55] and have the package only install the correct one, obviously :) [20:06:17] (03PS1) 10Dzahn: Revert "dumps: improve nginx disk utilisation via directio" [puppet] - 10https://gerrit.wikimedia.org/r/207549 [20:06:32] (03CR) 10Dzahn: [C: 032] Revert "dumps: improve nginx disk utilisation via directio" [puppet] - 10https://gerrit.wikimedia.org/r/207549 (owner: 10Dzahn) [20:06:52] (03CR) 10Dzahn: [V: 032] Revert "dumps: improve nginx disk utilisation via directio" [puppet] - 10https://gerrit.wikimedia.org/r/207549 (owner: 10Dzahn) [20:09:28] (03CR) 10Dzahn: "unfortunately breaks. unknown directive "aio"" [puppet] - 10https://gerrit.wikimedia.org/r/190940 (owner: 10Ori.livneh) [20:09:46] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [20:10:03] uh oh [20:10:27] ^ did something known cause that? [20:10:37] PROBLEM - Parsoid on wtp1002 is CRITICAL - Socket timeout after 10 seconds [20:10:57] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [20:10:57] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [20:11:04] arlolra, requires a revert. [20:11:07] PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused [20:11:10] all I see is the start-of-SWAT accounce and ori's README test-deploys [20:11:18] PROBLEM - Parsoid on wtp1003 is CRITICAL - Socket timeout after 10 seconds [20:11:33] but yes, please revert if that's a possible thing in respect to the above [20:11:41] 10 minutes ago i see jouncebot, but where is a merge? [20:12:01] something crash parsoid? [20:12:06] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [20:12:07] PROBLEM - Parsoid on wtp1008 is CRITICAL: Connection refused [20:12:17] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [20:12:17] PROBLEM - Parsoid on wtp1009 is CRITICAL - Socket timeout after 10 seconds [20:12:17] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [20:12:20] I think we're just missing the log-line, which normally gets !log'd Elsewhere to appear here [20:12:27] PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused [20:12:27] PROBLEM - Parsoid on wtp1013 is CRITICAL: Connection refused [20:12:27] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [20:12:28] 13:01 < jouncebot> gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150429T2000). [20:12:32] chasemp: ^ that [20:12:46] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [20:12:47] PROBLEM - Parsoid on wtp1010 is CRITICAL: Connection refused [20:12:50] yes but we normally also see a logmsgbot for each individual thing pushed out [20:12:52] <_joe_> hey, deploy problems [20:12:55] <_joe_> ? [20:12:57] PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused [20:12:57] PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused [20:12:58] PROBLEM - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is CRITICAL: Connection refused [20:12:58] https://www.mediawiki.org/wiki/Parsoid/Deployments#Wednesday.2C_April_29.2C_2015_around_1pm_PST:_45b54f63_to_be_deployed [20:13:00] yes, reverting. [20:13:03] PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused [20:13:05] <_joe_> ok [20:13:06] i wonder where the actual change is linked? [20:13:08] <_joe_> sorry [20:13:18] PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused [20:13:27] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [20:13:28] bblack: only for scap/sync-*. Trebuchet doesn't automatically announce [20:13:47] PROBLEM - Parsoid on wtp1024 is CRITICAL: Connection refused [20:13:56] hey [20:13:56] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [20:13:57] what's up? [20:14:06] bd808: yeah but isn't it a normal part of the process for humans to do so? [20:14:06] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [20:14:14] bblack: it should be, yes [20:14:14] mark: bad parsoid-related code deploy, they're reverting [20:14:37] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.017 second response time [20:14:40] !log twentyafterfour Started scap: testwiki to php-1.26wmf4 and rebuild l10n cache [20:14:47] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.008 second response time [20:14:47] RECOVERY - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.023 second response time [20:14:48] Logged the message, Master [20:14:53] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.012 second response time [20:14:53] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.020 second response time [20:15:06] parsoid has high CPU but low network [20:15:07] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.016 second response time [20:15:16] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.015 second response time [20:15:19] it seems to be back-ish [20:15:36] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.016 second response time [20:15:36] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.022 second response time [20:15:37] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.035 second response time [20:15:37] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.017 second response time [20:15:37] reverted [20:15:47] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.025 second response time [20:15:47] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.016 second response time [20:15:48] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.019 second response time [20:15:57] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.036 second response time [20:15:57] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.008 second response time [20:15:57] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.045 second response time [20:16:07] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.013 second response time [20:16:07] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.017 second response time [20:16:12] time to check the logs to see what went wrong. [20:16:17] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.018 second response time [20:16:17] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.020 second response time [20:16:26] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.010 second response time [20:16:26] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.022 second response time [20:16:30] can we get a link to whatever was pushed then reverted? [20:16:39] please^ [20:16:40] yes. will do. [20:17:22] !log attempted deploy of 45b54f63 (failed) [20:17:28] Logged the message, Master [20:17:34] !log reverted deploy to ebdac59b [20:17:39] https://gerrit.wikimedia.org/r/#/c/207012/ [20:17:40] Logged the message, Master [20:17:47] (from the sha) [20:17:53] arlolra, can you check the logs on any of the nodes to see what happened? [20:18:07] yeah, has to do with the freezing [20:18:18] {"name":"parsoid","hostname":"wtp1004","pid":32326,"level":60,"logType":"fatal","process":{"name":"worker","pid":32326},"msg":"uncaught exception Cannot assign to read only property 'host' of [object Object]","stack":"TypeError: Cannot assign to read only property 'host' of [object Object]\n at /srv/deployment/parsoid/deploy/node_modules/node-txstatsd/index.js:79:27\n at asyncCallback (dns.js:68:16)\n at Object.onanswer [as oncomple [20:18:18] (dns.js:121:9)","longMsg":"uncaught exception\nCannot assign to read only property 'host' of [object Object]","time":"2015-04-29T20:14:34.115Z","v":0} [20:18:37] so, some code is mutating then. [20:18:42] ok --> #mediawiki-parsoid [20:19:17] k [20:19:24] subbu: can you guys do an outage writeup? [20:19:44] yes. [20:21:33] (03PS2) 10Yuvipanda: tools: Make webservice read default server service manifest [puppet] - 10https://gerrit.wikimedia.org/r/207380 (https://phabricator.wikimedia.org/T94788) [20:21:41] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Make webservice read default server service manifest [puppet] - 10https://gerrit.wikimedia.org/r/207380 (https://phabricator.wikimedia.org/T94788) (owner: 10Yuvipanda) [20:24:29] ori: i tried your "dumps: improve nginx disk utilisation via directio" but unfortunately didn't work yet [20:24:58] ori: unknown directive "aio". like missing a module? [20:25:07] the config itself looked good from docs [20:26:01] greg-g: once the train and SecurePoll fixes are deployed, can I deploy a fix for global rename to stop making pages disappear? https://gerrit.wikimedia.org/r/207557 it already made it into wmf4, just needs wmf3 [20:26:21] so just now, sync proxies only listed one proxy to sync? isn't that supposed to be like 12 [20:27:41] mutante: what version of nginx (or OS) is the dumps server? [20:28:39] legoktm: I can't speak for greg but I don't see why it wouldn't be ok? [20:29:02] (03PS1) 10coren: Labs: Make sure that nfs-no-idmap gets applied first [puppet] - 10https://gerrit.wikimedia.org/r/207559 [20:29:07] mutante: yes, that's why I hadn't merged it. I should have -1'd it, sorry [20:29:11] YuviPanda: legoktm: ^^ [20:29:20] bblack: 1.1.9 [20:29:32] faidon and i tested it [20:29:34] i see other reports where people have nginx compiled with --with-file-aio [20:29:34] mutante: usually nginx modules are static, and probably precise's version or build just doesn't have it [20:29:35] but i didn't update the patch [20:29:40] (03CR) 10Yuvipanda: [C: 031] Labs: Make sure that nfs-no-idmap gets applied first [puppet] - 10https://gerrit.wikimedia.org/r/207559 (owner: 10coren) [20:29:42] but it still says it cant find aio [20:30:09] (03CR) 10coren: [C: 032] "Simple ordering fix." [puppet] - 10https://gerrit.wikimedia.org/r/207559 (owner: 10coren) [20:30:24] ori: bblack: it's not compiled with TLS SNI support enabled [20:30:25] configure arguments: --prefix=/etc/nginx --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-log-path=/var/log/nginx/access.log --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --lock-path=/var/lock/nginx.lock --pid-path=/var/run/ngin [20:30:34] argg, i didn't want to do that [20:30:38] legoktm: sure thing [20:30:56] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.020 second response time [20:30:59] it's not compiled with --with-file-aio [20:31:02] ori: [20:31:12] twentyafterfour: you know, maybe we should fix that (you speaking for me, not a bad idea ;) ) [20:31:42] lol [20:32:01] I think accidentally did a time or two. [20:32:10] heh. I do too [20:32:12] (03Abandoned) 10Catrope: Disable direct RESTbase access in labs for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207518 (owner: 10Catrope) [20:32:23] (03Abandoned) 10Catrope: Don't use /api endpoint for RESTbase in labs, it doesn't work (404s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207516 (https://phabricator.wikimedia.org/T97558) (owner: 10Catrope) [20:32:25] the role is more traffic cop than voice of god [20:32:36] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.031 second response time [20:32:56] ok so sync-common seems to be doing fine even though there was only one proxy? [20:33:13] what happened to the proxy list? Puppet fubar? [20:33:40] bd808: I don't know. I just saw this: sync-proxies: 100% (ok: 1; fail: 0; left: 0) [20:34:04] puppet fubar! [20:34:12] mw1010.eqiad.wmnet\nmw1033.eqiad.wmnet\nmw1070.eqiad.wmnet\nmw1097.eqiad.wmnet\nmw1216.eqiad.wmnet\nmw1161.eqiad.wmnet\nmw1201.eqiad.wmnet\nmw2001.codfw.wmnet\nmw2041.codfw.wmnet\nmw2080.codfw.wmnet\nmw2119.codfw.wmnet\nmw2187.codfw.wmnet [20:34:26] I thought it selected the proxies dynamically [20:34:38] !log /etc/dsh/group/scap-proxies is borken on tin [20:34:45] Logged the message, Master [20:34:49] it reads that file [20:34:56] (03CR) 10Yurik: [C: 031] Prevent new wikis from using Graph: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 (owner: 10Dereckson) [20:34:57] is this because of the "move dsh to hiera" patches? [20:35:06] which I think ^d converted to some new puppet magic? [20:35:30] <^d> Grrr. [20:35:33] bd808: did… it do the \n literally? [20:35:34] there were 2 things: ^d moving it to hiera and _joe_ suggesting they be generated [20:35:38] not sure if one was merged [20:35:44] so uhm ... how is it working if that file is screwed ? [20:35:44] YuviPanda: yes [20:35:48] <^d> It hasn't moved to hiera yet (that wasn't merged) [20:35:54] <^d> This was just the join() change [20:36:12] <^d> Which is obvs broken. [20:36:21] twentyafterfour: everything will fall back to syncing from tin [20:36:37] heh ok [20:36:42] https://gerrit.wikimedia.org/r/#/c/206132/ [20:36:42] <^d> Uno momento [20:36:44] not sure what happens with the ssh command to run the sync itself [20:36:44] <^d> Patch coming [20:36:45] seems faster than I would expect in that case [20:36:45] (03CR) 10BBlack: "I don't think this is actually necessary. In the combined context of what vcl_error looks like in practice, "return (deliver)" is always " [puppet] - 10https://gerrit.wikimedia.org/r/206351 (owner: 10Ori.livneh) [20:37:15] sync-common: 16% (ok: 80; fail: 0; left: 396) [20:37:23] I guess it's a bit slower [20:37:47] it will go really slow when you get tin swamped out in io and syncing to codfw [20:37:53] should I abort and try again with proper proxies? [20:37:58] (03PS1) 10Chad: Followup I5303c527: Use actual \n and not fake \n when joining [puppet] - 10https://gerrit.wikimedia.org/r/207615 [20:38:08] is ctrl+c safe during sync-common? [20:38:11] <^d> YuviPanda: ^ [20:38:32] twentyafterfour: safe-ish. you haven't bumped the verison yet right? [20:38:33] booo pupppet [20:38:42] *version [20:38:47] (03CR) 10Yuvipanda: [C: 032] Followup I5303c527: Use actual \n and not fake \n when joining [puppet] - 10https://gerrit.wikimedia.org/r/207615 (owner: 10Chad) [20:39:11] running puppet on tin now [20:39:35] should we maybe have a test rule check for '\n' in puppet code and output a warning? I guess there are valid cases for the literal '\n' [20:39:58] bd808: I was just bumping the version of testwiki [20:40:22] oh, just a sync-file? [20:40:28] let that go it will be fine [20:40:33] bd808: no it's syncing the new branch [20:41:09] scap "testwiki to php-$VERSION and rebuild l10n cache" [20:41:17] I'd kill the scap and start over [20:41:21] ok [20:41:30] test wiki might be sad for a bit but ... testwiki [20:41:33] !log twentyafterfour scap aborted: testwiki to php-1.26wmf4 and rebuild l10n cache (duration: 26m 52s) [20:41:39] Logged the message, Master [20:41:51] !log twentyafterfour Started scap: testwiki to php-1.26wmf4 and rebuild l10n cache - attempt #2 [20:41:59] Logged the message, Master [20:42:08] the scap-proxies file looks good now twentyafterfour, YuviPanda, ^d [20:42:17] sorry about that, everyone [20:42:23] <^d> yeah that ^ [20:42:47] it's ok :) [20:43:59] twentyafterfour: gald you said something before you waited 90 minutes for it to finish :) [20:44:55] I actually said something when it started but there was a parsoid deploy problem and I didn't want to spam the channel further [20:45:27] ok so there are 6 proxies this time [20:45:37] oh no it's 12 [20:46:21] cool. all good, should be only 20-30 minutes now I guess [20:50:17] so ori, ^d: do you want to sync your patch right after this scap finishes? I can finish up the version bump after you do that if you want [20:52:13] (03PS1) 10Yurik: Enable fallback graphoid service for non-js client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207631 [20:53:06] oohh it's almost done already :) [20:54:53] greg-g, hi, graphoid is finally in production, can i go after swat to enable it? [20:55:56] we've already had multiple service-related deployment snafus today, best be sure that's safe... [20:56:18] yurik: I have no idea what that means, honestly [20:56:29] yurik: and I don't see it on the #roadmap project nor on the Deployments calendar [20:56:36] I'm not a huge fan of last minute surprises [20:56:50] (03CR) 1020after4: [C: 032] Wikipedias to 1.26wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207546 (owner: 1020after4) [20:56:55] (03Merged) 10jenkins-bot: Wikipedias to 1.26wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207546 (owner: 1020after4) [20:57:02] greg-g, it has been in production roadmap since forever it seems, let me give you the links [20:57:12] https://phabricator.wikimedia.org/project/sprint/board/1109/query/open/ [20:57:30] 10Ops-Access-Requests, 6operations, 10CirrusSearch, 6Search-Team: James Douglas sudo on Elasticsearch machines - https://phabricator.wikimedia.org/T97559#1247057 (10Dzahn) p:5Triage>3Normal [20:58:45] greg-g, sorry, didn't know about that board - the graph extension has been enabled on many sites, and briefly it was even enabled on one of the language wikis - but we pulled it because Eloquence requested that we deploy backup graphoid service first (renders graphs on the server for non-javascript clients). Now the graphoid is done, so i would like to re-enable the graph extension. [20:58:47] https://www.mediawiki.org/wiki/Extension:Graph#Graphoid_Service [20:59:22] ^d: noticed we have admin groups "search-roots" and also "elasticsearch-roots" and same members [20:59:25] sounds like a great thing to schedule for next week [20:59:27] greg-g, https://phabricator.wikimedia.org/T90487 [21:00:01] <^d> mutante: search-roots was lsearchd [21:00:05] <^d> can be killed afaict [21:00:07] yurik: is it running in beta cluster? [21:00:12] ^d: *nod* ok! [21:00:51] greg-g, graphoid is running in production -- http://graphoid.wikimedia.org/mediawiki.org/v1/png/Extension:Graph/0/be66c7016b9de3188ef6a585950f10dc83239837.png [21:00:52] <^d> Same with search-users [21:01:03] sync-common: 99% (ok: 464; fail: 0; left: 1) [21:01:14] yurik: not answering my question [21:01:31] yurik: is it running in beta cluster/have you tested it there? does it get tested there? [21:01:37] (03PS1) 10Dzahn: admin: delete search-roots and search-users [puppet] - 10https://gerrit.wikimedia.org/r/207633 [21:01:38] ^d: yep, ^ [21:01:41] greg-g, this is the live graph coming from https://www.mediawiki.org/wiki/Extension:Graph . I'm not sure if it was enabled in beta, no [21:01:58] (03CR) 10Chad: [C: 031] admin: delete search-roots and search-users [puppet] - 10https://gerrit.wikimedia.org/r/207633 (owner: 10Dzahn) [21:02:07] gwicke, ^^^ [21:02:11] ^d: but it's also called "elastic search testing" [21:02:15] yurik: you're responsibility [21:02:20] twentyafterfour: snapshot1004.eqiad.wmnet is the last one syncing [21:02:38] * YuviPanda takes away greg-g’s ‘s [21:02:47] greg-g, might be hard to check - many people were involved in deployment, let me check. [21:02:47] yurik: so, no. not today. Set it up in Beta Cluster, test it there (as with everything we do) then schedule a non-SWAT time to enable it next week [21:02:48] mutante: was the ssh change you merged yesterday about ciphers? [21:03:17] gwicke, do you know anything about graphoid on betalabs? [21:03:20] YuviPanda: yes [21:03:43] yurik: not much, no [21:03:44] yurik: https://wikitech.wikimedia.org/wiki/Labs_labs_labs#Beta_Cluster :P [21:03:52] 10Ops-Access-Requests, 6operations, 10CirrusSearch, 6Search-Team: James Douglas sudo on Elasticsearch machines - https://phabricator.wikimedia.org/T97559#1247092 (10Dzahn) noticed we have multiple groups: search-roots search-users elasticssearch-roots but the first 2 were for old search: https://gerrit.... [21:04:01] !log load avg on snapshot04 11.11; scap slow waiting on it [21:04:04] yurik: ping me after the RFC meeting [21:04:07] Logged the message, Master [21:04:20] YuviPanda: thanks, btw, I hate when I do you're/your mistakes [21:04:23] greg-g, yes, i know - but i was not involved in the prod deployment - only was told today that it is live [21:04:30] i will try to find out [21:04:33] yurik: you wrote the extension, yes? [21:04:35] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1247096 (10BBlack) [21:04:40] (03CR) 1020after4: "alexandros: is your -1 because of the group/mode defaults? I really feel that sane defaults make a lot more sense than specifying redunda" [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) (owner: 1020after4) [21:04:42] greg-g, extension and service, yes [21:04:51] yurik: and I trust you aren't just testing in production? [21:04:52] (03CR) 1020after4: [C: 031] Move maniphest status settings into custom/wmf-defaults.php [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) (owner: 1020after4) [21:04:53] extension has been in production since before forever :) [21:04:53] or are you? [21:04:53] mutante: apper in -labs is having issues with it, can you look at it [21:04:55] YuviPanda: any issues? we tested on all the distros versions we use [21:05:26] yurik: yeah but not with it hitting graphoid service behind itself, right? [21:05:26] greg-g, extension is live as of last year - zero and a few other non-lang sites are using it for production [21:05:46] bblack, correct - graphoid service is purelly as a backup for non-js clients though, but yes, not in prod [21:05:55] YuviPanda: yes [21:06:16] bblack, greg-g, would it make sense to enable graphoid at first only for the sites that already have graph ext enabled? [21:06:25] sorry I was being unclear, but I meant "no previous deploy of anything graph-related has ever trialed live traffic -> graph extension -> graphoid" [21:06:27] yurik: so, it sounds like before you can enable this the service needs to be setup in beta cluster then tested there then enabled in prod. [21:06:28] mutante: http://deployment.wikimedia.beta.wmflabs.org/wiki/Main_Page doesn't have an https version ? [21:06:44] matanya: nope [21:06:56] twentyafterfour: if you are tired of waiting you can kill the ssh connection for sync-common to snapshot1004 on tine [21:06:57] greg-g: by design ? [21:06:58] ssl certs are expensive and ohmygodthatsaganeverdies [21:07:04] matanya: because beta doesn't have ssl certs [21:07:05] :) [21:07:07] greg-g, but the graphoid service is already live in production, and since it is very very simple (url->image), there is not much more to test [21:07:20] matanya: could use self-signed, not sure why not [21:07:28] huge ticket ... [21:07:32] yurik: if I let you do this now what incentive do you have to set it up correctly in beta cluster? [21:07:34] greg-g: 5$ for cheap ssl, 0 for self signed [21:07:34] nothing is simple when a slight configuration error somewhere could result in millions of request in a very short period of time [21:07:48] greg-g, my promise? :D [21:08:08] matanya: https://phabricator.wikimedia.org/T70387 [21:08:11] matanya: https://phabricator.wikimedia.org/T50501 [21:08:12] matanya: there are some other related woes as well [21:08:13] bblack, true, but beta labs would not find that error [21:08:21] yurik: the short answer is: I'm getting tired of (not just your) all these exceptions people keep making for themselves to push things out quicker without caring about testing [21:08:21] it would in some cases! [21:08:49] yurik: summary: set it up in beta cluster correctly, then do it next week. [21:08:54] ok [21:08:55] tbh, somewhat freaks me out to type in my password when no ssl in place [21:08:56] thanks! [21:09:09] matanya: don't use your prod password [21:09:22] matanya: accounts on beta cluster should be junk accounts with junk passwords [21:09:23] lack of ssl is kindof a big deal though [21:09:23] greg-g: I must, testing stewards tools [21:09:36] stewards tools in beta? [21:09:40] matanya: so it's hitting prod? [21:09:48] global renames [21:09:49] there is no SUL from prod to beta [21:09:56] there is [21:09:58] * greg-g is kinda scared all of a sudden :) [21:10:04] (more than previously) [21:10:05] no there is not [21:10:08] or atleast, it worked [21:10:16] I just logged in [21:10:39] you may have an account there that you created with the same password [21:10:54] Please don't tell me you're using the same password in beta as you are in prod... [21:11:01] but beta cluster is definitely not connected to prod SUL [21:11:59] wow, the tickets re beta+SSL are ridiculously long and complicated history at this point [21:12:02] Krenair: I don't think so, I just used a password stored in my browser, let me check [21:12:13] does someone have a shorter statement of what the current blocker actually is? [21:12:27] 10:50, 17 July 2012 User account Matanya (Talk | contribs | block) was created [21:12:56] bblack: cost + how to secure the private cert in beta cluster [21:13:15] bblack: 1) cost of *.*.*.wmflabs.org real certs 2) how to secure them when everyone's got sudo [21:13:26] because self-signed doesn't really help [21:13:27] matanya, note that shell access to the beta cluster is not hard to get, you don't even need an NDA. you should never trust the password to your steward account to it [21:13:37] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:13:51] ok 1) seems like bs to me. Can we not flatten to x-y-z-w-a-b.beta.wmflabs.org + a single *.beta.wmflabs.org cert? [21:13:55] Krenair: not the same one [21:13:59] ok, good [21:14:09] self signed cert would be better than nothing wouldn't it? [21:14:14] bblack: i'm afraid the answer is nobody has that kind of summary [21:14:17] yes, it would [21:14:23] I mean, not for security but for testing ssl [21:14:37] if the flattening thing isn't that hard, it would be even better just to go buy a single cert and be done. [21:14:42] because without ssl available beta is a flawed testing ground [21:14:43] bblack: yet another prod/beta diff? :P [21:14:54] we already have diffs there, the hostnames will never match [21:15:00] (and don't today) [21:15:04] hey, fun: (Cannot access the database: Can't connect to MySQL server on '10.68.16.193' (4) (10.68.16.193)) [21:15:05] !log twentyafterfour Finished scap: testwiki to php-1.26wmf4 and rebuild l10n cache - attempt #2 (duration: 33m 13s) [21:15:12] Logged the message, Master [21:15:24] so no beta for me [21:15:52] matanya: just got that as well [21:15:59] I mean used in labs you can't actually trust them for authentication but then again, at least it's one extra good layer [21:16:05] matanya: see -releng [21:16:09] bblack: greg-g: we have CA, why cant we use it to make certs for beta? [21:16:19] that's neither self-signed nor paying [21:16:22] because it won't work by default everywhere [21:16:25] mutante: how to secure them? [21:16:37] that is self-signing, it's just signing-with-self-CA. [21:16:41] greg-g: just like we do with domainproxy for *.wmflabs.org ? [21:16:42] working by default everywhere isn't mandatory [21:16:59] I'd rather we just deploy this securely [21:17:00] espcially for testing hosts [21:17:13] mutante: bblack twentyafterfour feel free to create a new task with better ideas :) (seriously) [21:17:13] bblack: yea, but is it not reasonable to ask beta testers to import a WMF CA once? [21:17:15] isn't it impossible to do anything in beta securely? [21:17:18] how broad is beta roots anyways? we can't trust them with not screwing with the beta cert? [21:17:21] and still better than regular self-signed [21:17:52] one CA import, that should be trivial for testers [21:17:54] bblack: I think the general idea is that no one watches roots or labs access closely [21:17:59] bblack: https://wikitech.wikimedia.org/wiki/Special:NovaSudoer [21:18:15] it is not like you are asking a random editor to do that [21:18:16] bah, stupid openstackmanager [21:18:42] either way, worst case they "compromise" beta access, it's not the end of the world. congrats, now if you also have a MITM spot, you can hijack our global beta traffic. [21:18:55] * matanya forgets about testing global merge today [21:18:59] or cause us to spend some money to reissue it and re-evaluate sec [21:19:18] maybe the issue is conflated, a cert for likeness to prod and not for authentication is still valuable? [21:19:20] bblack: https://phabricator.wikimedia.org/P577 [21:19:23] maybe it ticks all necessary boxes [21:19:56] it's even valuable if there were invalid certs, at least then services would start for testing vs. just breaking [21:19:57] none of those sound like crazy hax0rz to me [21:20:17] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.26wmf3 [21:20:24] Logged the message, Master [21:20:26] what is this list greg-g ? NDA's ? [21:20:27] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [21:20:31] ok I'm gonna make a new ticket, because sometimes tickets are like standards ( https://xkcd.com/927/ ) [21:20:32] matanya: people with sudo [21:20:35] I do understand the invalid cert hate tho, it's a terrible pain to overcome cert errors for validation testing in many cases [21:20:42] matanya: on beta cluster [21:21:09] chasemp: it's stell less bad then nginx/apache not starting at all? [21:21:12] than [21:21:30] I don't know, up to beta ppl [21:21:45] (03CR) 1020after4: [C: 032] Group0 to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207547 (owner: 1020after4) [21:21:50] (03Merged) 10jenkins-bot: Group0 to 1.26wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207547 (owner: 1020after4) [21:21:54] hah [21:21:56] I think we can push through the issues and rename some beta hostnames and get a real cert and not 20 of them [21:21:59] we'll see [21:22:46] idk honestly but I have done the self CA for beta/staging and advantage was once you trusted the ca you could easily and cheaply mirror prod [21:22:47] in any case, I now have more than a passing interest because I want to refactor the related puppet code for nginx/ssl stuff soon, and the diff for beta there is a real thorn [21:22:59] bblack: i serious question: why not call it beta.en.wikipedia.org [21:22:59] whereas doing hoops to keep cert costs / trouble down is more pain [21:23:05] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.26wmf4 [21:23:12] what I don't like about self-sign/self-CA is people will run into it when testing with arbitrary devices/browsers unless they can get the CA installed there [21:23:12] Logged the message, Master [21:23:28] no doubt it's a trade off [21:23:37] we already spent the cert cost by paying people to update that ticket? [21:23:40] matanya, ... you want to put labs stuff behind prod hostnames? [21:23:44] matanya: (a) wildcard certs only cover one level of depth and (b) we don't want to share certs between prod/beta for sure [21:23:52] no Krenair , a fake 301 [21:24:21] real world ppl don't understand "beta" and will easily be fooled if beta is actually owned [21:24:30] in that case [21:24:53] real world ppl, liked that one [21:24:57] PROBLEM - puppet last run on snapshot1004 is CRITICAL puppet fail [21:25:17] PROBLEM - RAID on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:25:26] anyways, new new ticket incoming at some point in the next hour or so [21:25:55] bblack: yay! [21:27:18] (03PS2) 10Aude: Enable use of subscriptions table on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207535 [21:27:20] !log twentyafterfour Purged l10n cache for 1.26wmf2 [21:27:25] Logged the message, Master [21:28:10] (03CR) 10Manybubbles: [C: 031] admin: delete search-roots and search-users [puppet] - 10https://gerrit.wikimedia.org/r/207633 (owner: 10Dzahn) [21:28:28] ok all done deploying the train [21:28:36] RECOVERY - RAID on snapshot1004 is OK no RAID installed [21:28:46] 10Ops-Access-Requests, 6operations, 10CirrusSearch, 6Search-Team: James Douglas sudo on Elasticsearch machines - https://phabricator.wikimedia.org/T97559#1246473 (10Manybubbles) Added @tfinc as this is an access request and he's the manager who'd give approval. If that is still the process. [21:30:53] * hoo sees a lot of UnexpectedValueException from line 957 of /srv/mediawiki/php-1.26wmf3/includes/api/ApiResult.php: Unknown type 'lang' [21:31:04] Are people aware? [21:31:07] wmf3 [21:31:55] what is wmf3? [21:32:14] jgage: 1.26wmf3 [21:32:19] thanks [21:32:19] jgage: the version on wikipedias [21:32:27] wasn't familiar with the shorthand [21:32:38] hoo, that was already known [21:32:52] jgage: https://www.mediawiki.org/wiki/MediaWiki_1.26/Roadmap#Schedule_for_the_deployments [21:32:52] hoo, I don't think it should still be occurring though... anomie ^ [21:34:12] greg-g: cool, so organized :) [21:37:37] (03CR) 10Dzahn: [C: 032] admin: delete search-roots and search-users [puppet] - 10https://gerrit.wikimedia.org/r/207633 (owner: 10Dzahn) [21:39:27] (03PS1) 10Dzahn: add jdouglas to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/207643 (https://phabricator.wikimedia.org/T97559) [21:40:26] PROBLEM - RAID on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:40:38] (03CR) 10Manybubbles: [C: 031] add jdouglas to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/207643 (https://phabricator.wikimedia.org/T97559) (owner: 10Dzahn) [21:45:16] RECOVERY - RAID on snapshot1004 is OK no RAID installed [21:46:23] hoo: is it still happening? [21:46:49] legoktm: Yes [21:46:59] :( [21:47:24] very concerning [21:47:33] do we have a stacktrace? [21:47:42] yes it's exception.log [21:47:50] * aude cannot look [21:48:20] http://fpaste.org/216969/43034409/raw/ [21:48:21] https://phabricator.wikimedia.org/T97469 ? [21:49:29] hmm [21:49:47] RECOVERY - puppet last run on snapshot1004 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:54:36] somehow $type is becoming 'lang' [21:55:09] so $type = $metadata[self::META_TYPE]; or $type = $defaultType; [21:55:31] legoktm: Was changing the output format of opensearch wanted? [21:55:34] That broke wikibase [21:55:37] argh [21:55:38] no. [21:55:41] it's easy to adopt, just wonder whether we should [21:55:49] *nothing* should have changed [21:56:09] can you file a bug for that? [21:56:15] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labvirt1005 memory errors - https://phabricator.wikimedia.org/T97521#1245178 (10Andrew) All instances are now migrated off of labvirt1005 -- Chris, you can do whatever you need to fix this box; I'm going to re-image it before putting it back to work. [21:57:09] legoktm: I guess so, yes [21:57:17] Will have to dig deeper though [21:57:23] I know it broke, but not yet how [21:57:54] it's probably missing one of the ApiResult::BC_ flags [21:59:31] quite possible [21:59:36] No [21:59:39] seems https://gerrit.wikimedia.org/r/#/c/191103/11/includes/api/ApiOpenSearch.php is the problem [21:59:46] because the index keys shifted [21:59:48] (03Abandoned) 10Kaldari: Revert "Turning on WikiGrok on English Wikipedia (for 2 week test)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206030 (owner: 10Kaldari) [22:00:08] We used to use response[1], response[0], now we need response[2], response[1] [22:01:53] Diffusion! [22:14:02] legoktm: Shall I investigate more and open a bug or shall we just revert? [22:15:06] hoo: it's like 10-15 patches + extensions so a revert is going to be tough... [22:15:28] hoo: please file a bug yeah [22:17:51] (03CR) 10Kaldari: "As soon as change Ibb9a398aa79 goes live next week, we can deploy this as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202926 (https://phabricator.wikimedia.org/T95007) (owner: 10Kaldari) [22:28:47] PROBLEM - puppet last run on mw2163 is CRITICAL puppet fail [22:31:08] doh. dbstore1002's root partition is full. [22:31:48] and for once /var/log/ is not to blame [22:34:19] (03PS2) 10Yuvipanda: tools: Enable hba automatically for exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/207537 [22:34:31] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Enable hba automatically for exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/207537 (owner: 10Yuvipanda) [22:35:28] (03CR) 10Brian Wolff: "I don't think this hook is working. https://en.wikipedia.org/w/api.php?titles=File%3AJere_Beasley.JPG&iiprop=extmetadata&prop=imageinfo&ii" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207279 (https://phabricator.wikimedia.org/T97469) (owner: 10Anomie) [22:36:19] dbstore1002: it's /srv/tmp [22:37:02] springle: you around? [22:37:53] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1247411 (10ssastry) Goes without saying that the individual services should also be able to work with the fact that multiple versions of t... [22:38:12] jgage: aye, just got the icinga ping [22:39:00] springle great, ok. i didn't get an alert, just happened to notice it on the dashboard [22:42:06] RECOVERY - Disk space on dbstore1002 is OK: DISK OK [22:43:36] (03PS1) 10Yuvipanda: tools: Set hba as a global variable [puppet] - 10https://gerrit.wikimedia.org/r/207656 [22:44:15] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Set hba as a global variable [puppet] - 10https://gerrit.wikimedia.org/r/207656 (owner: 10Yuvipanda) [22:44:17] (03CR) 10jenkins-bot: [V: 04-1] tools: Set hba as a global variable [puppet] - 10https://gerrit.wikimedia.org/r/207656 (owner: 10Yuvipanda) [22:47:27] RECOVERY - puppet last run on mw2163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:48:07] PROBLEM - puppet last run on mw2073 is CRITICAL puppet fail [22:48:15] !log dbstore1002 /srv/tmp filled up. killed queries, fixed mount point, restarted mysqld [22:48:17] PROBLEM - puppet last run on mw2113 is CRITICAL puppet fail [22:48:22] Logged the message, Master [22:48:26] PROBLEM - puppet last run on lvs2004 is CRITICAL puppet fail [22:48:27] PROBLEM - puppet last run on mw1065 is CRITICAL puppet fail [22:48:28] !log legoktm Synchronized php-1.26wmf3/includes/MovePage.php: MovePage: Move target existence check into isValidMove() - https://gerrit.wikimedia.org/r/#/c/207557/ (duration: 00m 26s) [22:48:34] Logged the message, Master [22:48:37] PROBLEM - puppet last run on mw1118 is CRITICAL puppet fail [22:48:46] PROBLEM - puppet last run on mw2079 is CRITICAL puppet fail [22:48:46] PROBLEM - puppet last run on mw2097 is CRITICAL puppet fail [22:48:47] PROBLEM - puppet last run on mw1166 is CRITICAL puppet fail [22:48:47] PROBLEM - puppet last run on mw1123 is CRITICAL puppet fail [22:48:57] PROBLEM - puppet last run on mw2003 is CRITICAL puppet fail [22:48:57] PROBLEM - puppet last run on cp3014 is CRITICAL puppet fail [22:48:57] PROBLEM - puppet last run on mw2050 is CRITICAL puppet fail [22:48:57] PROBLEM - puppet last run on cp1056 is CRITICAL puppet fail [22:48:58] PROBLEM - puppet last run on db1028 is CRITICAL puppet fail [22:49:06] PROBLEM - puppet last run on mw1025 is CRITICAL puppet fail [22:49:07] PROBLEM - puppet last run on ruthenium is CRITICAL puppet fail [22:49:07] PROBLEM - puppet last run on virt1006 is CRITICAL puppet fail [22:49:07] PROBLEM - puppet last run on cp4008 is CRITICAL puppet fail [22:49:07] PROBLEM - puppet last run on db1015 is CRITICAL puppet fail [22:49:08] PROBLEM - puppet last run on mw1061 is CRITICAL puppet fail [22:49:16] PROBLEM - puppet last run on mw2093 is CRITICAL puppet fail [22:49:16] PROBLEM - puppet last run on db2018 is CRITICAL puppet fail [22:49:16] PROBLEM - puppet last run on mw2017 is CRITICAL puppet fail [22:49:16] PROBLEM - puppet last run on mc2011 is CRITICAL puppet fail [22:49:16] PROBLEM - puppet last run on mw2011 is CRITICAL puppet fail [22:49:17] PROBLEM - puppet last run on mw1170 is CRITICAL puppet fail [22:49:17] PROBLEM - puppet last run on db1046 is CRITICAL puppet fail [22:49:26] PROBLEM - puppet last run on mw2136 is CRITICAL puppet fail [22:49:26] PROBLEM - puppet last run on labvirt1003 is CRITICAL puppet fail [22:49:26] PROBLEM - puppet last run on mw1114 is CRITICAL puppet fail [22:49:26] PROBLEM - puppet last run on mw1039 is CRITICAL puppet fail [22:49:27] PROBLEM - puppet last run on db1059 is CRITICAL puppet fail [22:49:27] PROBLEM - puppet last run on mw2212 is CRITICAL puppet fail [22:49:27] PROBLEM - puppet last run on mw1054 is CRITICAL puppet fail [22:49:28] PROBLEM - puppet last run on mw2206 is CRITICAL puppet fail [22:49:28] PROBLEM - puppet last run on mw2134 is CRITICAL puppet fail [22:49:36] PROBLEM - puppet last run on db1040 is CRITICAL puppet fail [22:49:36] PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail [22:49:36] PROBLEM - puppet last run on mw1092 is CRITICAL puppet fail [22:49:37] PROBLEM - puppet last run on mw1211 is CRITICAL puppet fail [22:49:46] PROBLEM - puppet last run on mw1042 is CRITICAL puppet fail [22:49:47] PROBLEM - puppet last run on analytics1030 is CRITICAL puppet fail [22:49:47] PROBLEM - puppet last run on mw2090 is CRITICAL puppet fail [22:49:47] PROBLEM - puppet last run on mw2030 is CRITICAL puppet fail [22:49:47] PROBLEM - puppet last run on mw1175 is CRITICAL puppet fail [22:49:56] PROBLEM - puppet last run on analytics1010 is CRITICAL puppet fail [22:49:56] PROBLEM - puppet last run on mw1144 is CRITICAL puppet fail [22:49:56] PROBLEM - puppet last run on labnet1001 is CRITICAL puppet fail [22:49:57] PROBLEM - puppet last run on lvs2001 is CRITICAL puppet fail [22:49:57] PROBLEM - puppet last run on mw2059 is CRITICAL puppet fail [22:49:58] PROBLEM - puppet last run on holmium is CRITICAL puppet fail [22:50:06] PROBLEM - puppet last run on mw1177 is CRITICAL puppet fail [22:50:06] PROBLEM - puppet last run on mw1172 is CRITICAL puppet fail [22:50:06] PROBLEM - puppet last run on cp4019 is CRITICAL puppet fail [22:50:06] PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail [22:50:06] PROBLEM - puppet last run on mw1119 is CRITICAL puppet fail [22:50:07] PROBLEM - puppet last run on mw1129 is CRITICAL puppet fail [22:50:07] PROBLEM - puppet last run on wtp2012 is CRITICAL puppet fail [22:50:08] PROBLEM - puppet last run on cp3042 is CRITICAL puppet fail [22:50:08] PROBLEM - puppet last run on mw1011 is CRITICAL puppet fail [22:50:09] PROBLEM - puppet last run on db1034 is CRITICAL puppet fail [22:50:09] PROBLEM - puppet last run on tin is CRITICAL puppet fail [22:50:10] PROBLEM - puppet last run on elastic1027 is CRITICAL puppet fail [22:50:16] PROBLEM - puppet last run on wtp2015 is CRITICAL puppet fail [22:50:16] PROBLEM - puppet last run on mw2192 is CRITICAL puppet fail [22:50:16] PROBLEM - puppet last run on mw2123 is CRITICAL puppet fail [22:50:16] PROBLEM - puppet last run on mw2096 is CRITICAL puppet fail [22:50:16] PROBLEM - puppet last run on mw2092 is CRITICAL puppet fail [22:50:17] PROBLEM - puppet last run on mw2095 is CRITICAL puppet fail [22:50:17] PROBLEM - puppet last run on cp4003 is CRITICAL puppet fail [22:50:18] PROBLEM - puppet last run on db1021 is CRITICAL puppet fail [22:50:18] PROBLEM - puppet last run on dbproxy1001 is CRITICAL puppet fail [22:50:19] PROBLEM - puppet last run on wtp1005 is CRITICAL puppet fail [22:50:36] PROBLEM - puppet last run on logstash1006 is CRITICAL puppet fail [22:50:36] PROBLEM - puppet last run on cp1058 is CRITICAL puppet fail [22:50:36] PROBLEM - puppet last run on mw1237 is CRITICAL puppet fail [22:50:37] PROBLEM - puppet last run on db1067 is CRITICAL puppet fail [22:50:37] PROBLEM - puppet last run on labstore1001 is CRITICAL puppet fail [22:50:37] PROBLEM - puppet last run on wtp1012 is CRITICAL puppet fail [22:50:37] PROBLEM - puppet last run on mw1251 is CRITICAL puppet fail [22:50:38] PROBLEM - puppet last run on db2036 is CRITICAL puppet fail [22:50:38] PROBLEM - puppet last run on mw2056 is CRITICAL puppet fail [22:50:39] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [22:50:46] PROBLEM - puppet last run on mw1249 is CRITICAL puppet fail [22:50:47] PROBLEM - puppet last run on polonium is CRITICAL puppet fail [22:50:47] PROBLEM - puppet last run on mw1002 is CRITICAL puppet fail [22:50:47] PROBLEM - puppet last run on mw2085 is CRITICAL puppet fail [22:50:47] PROBLEM - puppet last run on mw2002 is CRITICAL puppet fail [22:50:47] PROBLEM - puppet last run on virt1001 is CRITICAL puppet fail [22:50:56] PROBLEM - puppet last run on cp3004 is CRITICAL puppet fail [22:50:56] PROBLEM - puppet last run on db2040 is CRITICAL puppet fail [22:50:57] PROBLEM - puppet last run on mw2166 is CRITICAL puppet fail [22:50:57] PROBLEM - puppet last run on mw2086 is CRITICAL puppet fail [22:50:57] PROBLEM - puppet last run on db2007 is CRITICAL puppet fail [22:50:57] PROBLEM - puppet last run on lvs3004 is CRITICAL puppet fail [22:50:57] PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail [22:50:58] PROBLEM - puppet last run on cp3041 is CRITICAL puppet fail [22:50:58] PROBLEM - puppet last run on elastic1030 is CRITICAL puppet fail [22:50:59] PROBLEM - puppet last run on db1023 is CRITICAL puppet fail [22:50:59] PROBLEM - puppet last run on gallium is CRITICAL puppet fail [22:51:06] PROBLEM - puppet last run on mw1044 is CRITICAL puppet fail [22:51:06] PROBLEM - puppet last run on db2029 is CRITICAL puppet fail [22:51:06] PROBLEM - puppet last run on mw1208 is CRITICAL puppet fail [22:51:06] PROBLEM - puppet last run on mw1213 is CRITICAL puppet fail [22:51:06] PROBLEM - puppet last run on db1016 is CRITICAL puppet fail [22:51:07] PROBLEM - puppet last run on virt1004 is CRITICAL puppet fail [22:51:07] PROBLEM - puppet last run on mw1126 is CRITICAL puppet fail [22:51:08] PROBLEM - puppet last run on elastic1022 is CRITICAL puppet fail [22:51:08] PROBLEM - puppet last run on mc1012 is CRITICAL puppet fail [22:51:09] PROBLEM - puppet last run on mw1055 is CRITICAL puppet fail [22:51:09] PROBLEM - puppet last run on lvs4003 is CRITICAL puppet fail [22:51:10] PROBLEM - puppet last run on db2057 is CRITICAL puppet fail [22:51:10] PROBLEM - puppet last run on mw2130 is CRITICAL puppet fail [22:51:26] PROBLEM - puppet last run on mw1190 is CRITICAL puppet fail [22:51:26] PROBLEM - puppet last run on cp3003 is CRITICAL puppet fail [22:51:26] PROBLEM - puppet last run on mc1014 is CRITICAL puppet fail [22:51:27] PROBLEM - puppet last run on db2043 is CRITICAL puppet fail [22:51:27] PROBLEM - puppet last run on es1007 is CRITICAL puppet fail [22:51:27] PROBLEM - puppet last run on mw2190 is CRITICAL puppet fail [22:51:27] PROBLEM - puppet last run on db2068 is CRITICAL puppet fail [22:51:28] PROBLEM - puppet last run on mw2132 is CRITICAL puppet fail [22:51:28] PROBLEM - puppet last run on ms-be2011 is CRITICAL puppet fail [22:51:29] PROBLEM - puppet last run on stat1003 is CRITICAL puppet fail [22:51:36] PROBLEM - puppet last run on mw1206 is CRITICAL puppet fail [22:51:36] PROBLEM - puppet last run on db1039 is CRITICAL puppet fail [22:51:36] PROBLEM - puppet last run on mw1149 is CRITICAL puppet fail [22:51:36] PROBLEM - puppet last run on antimony is CRITICAL puppet fail [22:51:37] PROBLEM - puppet last run on mw1079 is CRITICAL puppet fail [22:51:37] PROBLEM - puppet last run on mw1133 is CRITICAL puppet fail [22:51:37] PROBLEM - puppet last run on elastic1019 is CRITICAL puppet fail [22:51:38] PROBLEM - puppet last run on db2038 is CRITICAL puppet fail [22:51:38] PROBLEM - puppet last run on db2062 is CRITICAL puppet fail [22:51:39] PROBLEM - puppet last run on mw1162 is CRITICAL puppet fail [22:51:39] PROBLEM - puppet last run on mw2200 is CRITICAL puppet fail [22:51:40] PROBLEM - puppet last run on mw2062 is CRITICAL puppet fail [22:51:40] PROBLEM - puppet last run on snapshot1001 is CRITICAL puppet fail [22:51:43] (03PS1) 10Yuvipanda: tools: Rever attempts to set $ssh_hba in puppet [puppet] - 10https://gerrit.wikimedia.org/r/207659 [22:51:46] PROBLEM - puppet last run on cp4018 is CRITICAL puppet fail [22:51:47] PROBLEM - puppet last run on cp4001 is CRITICAL puppet fail [22:51:47] PROBLEM - puppet last run on db1048 is CRITICAL puppet fail [22:51:47] PROBLEM - puppet last run on cp3006 is CRITICAL puppet fail [22:51:47] PROBLEM - puppet last run on lvs2006 is CRITICAL puppet fail [22:51:47] PROBLEM - puppet last run on mw1051 is CRITICAL puppet fail [22:51:47] PROBLEM - puppet last run on cp3005 is CRITICAL puppet fail [22:51:48] PROBLEM - puppet last run on mw1111 is CRITICAL puppet fail [22:51:48] PROBLEM - puppet last run on cp1071 is CRITICAL puppet fail [22:52:00] PROBLEM - puppet last run on mw1049 is CRITICAL puppet fail [22:52:00] PROBLEM - puppet last run on analytics1022 is CRITICAL puppet fail [22:52:01] PROBLEM - puppet last run on gadolinium is CRITICAL puppet fail [22:52:04] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Rever attempts to set $ssh_hba in puppet [puppet] - 10https://gerrit.wikimedia.org/r/207659 (owner: 10Yuvipanda) [22:52:06] PROBLEM - puppet last run on mw2152 is CRITICAL puppet fail [22:52:06] PROBLEM - puppet last run on nembus is CRITICAL puppet fail [22:52:07] PROBLEM - puppet last run on mw2147 is CRITICAL puppet fail [22:52:07] PROBLEM - puppet last run on es2004 is CRITICAL puppet fail [22:52:07] PROBLEM - puppet last run on db2016 is CRITICAL puppet fail [22:52:07] PROBLEM - puppet last run on mc2014 is CRITICAL puppet fail [22:52:07] PROBLEM - puppet last run on ms-be2001 is CRITICAL puppet fail [22:52:08] PROBLEM - puppet last run on mw1084 is CRITICAL puppet fail [22:52:16] PROBLEM - puppet last run on mw1168 is CRITICAL puppet fail [22:52:16] PROBLEM - puppet last run on mw1151 is CRITICAL puppet fail [22:52:16] PROBLEM - puppet last run on mw1238 is CRITICAL puppet fail [22:52:16] PROBLEM - puppet last run on restbase1002 is CRITICAL puppet fail [22:52:16] PROBLEM - puppet last run on labstore1002 is CRITICAL puppet fail [22:52:17] PROBLEM - puppet last run on cp1046 is CRITICAL puppet fail [22:52:17] PROBLEM - puppet last run on mw1227 is CRITICAL puppet fail [22:52:18] PROBLEM - puppet last run on mc1005 is CRITICAL puppet fail [22:52:18] PROBLEM - puppet last run on mw2168 is CRITICAL puppet fail [22:52:19] PROBLEM - puppet last run on mw2125 is CRITICAL puppet fail [22:52:19] PROBLEM - puppet last run on mw2054 is CRITICAL puppet fail [22:52:20] um [22:52:20] PROBLEM - puppet last run on db1060 is CRITICAL puppet fail [22:52:20] PROBLEM - puppet last run on mw1180 is CRITICAL puppet fail [22:52:21] PROBLEM - puppet last run on db2004 is CRITICAL puppet fail [22:52:21] is that me? [22:52:33] * YuviPanda checks [22:52:36] PROBLEM - puppet last run on elastic1006 is CRITICAL puppet fail [22:52:36] PROBLEM - puppet last run on analytics1002 is CRITICAL puppet fail [22:52:37] PROBLEM - puppet last run on mw2149 is CRITICAL puppet fail [22:52:37] PROBLEM - puppet last run on mw2107 is CRITICAL puppet fail [22:52:46] PROBLEM - puppet last run on mw1050 is CRITICAL puppet fail [22:52:46] PROBLEM - puppet last run on mw1057 is CRITICAL puppet fail [22:52:47] PROBLEM - puppet last run on analytics1026 is CRITICAL puppet fail [22:52:47] PROBLEM - puppet last run on bast4001 is CRITICAL puppet fail [22:52:47] PROBLEM - puppet last run on ms-be2008 is CRITICAL puppet fail [22:52:47] PROBLEM - puppet last run on mw2048 is CRITICAL puppet fail [22:52:47] PROBLEM - puppet last run on cp1063 is CRITICAL puppet fail [22:52:48] yup, that was me [22:52:56] PROBLEM - puppet last run on db1020 is CRITICAL puppet fail [22:52:56] PROBLEM - puppet last run on virt1010 is CRITICAL puppet fail [22:52:56] PROBLEM - puppet last run on dbproxy1004 is CRITICAL puppet fail [22:52:57] PROBLEM - puppet last run on ms-fe3002 is CRITICAL puppet fail [22:52:57] PROBLEM - puppet last run on mw1183 is CRITICAL puppet fail [22:52:57] PROBLEM - puppet last run on db1004 is CRITICAL puppet fail [22:52:58] PROBLEM - puppet last run on mw1146 is CRITICAL puppet fail [22:52:58] PROBLEM - puppet last run on analytics1032 is CRITICAL puppet fail [22:52:58] PROBLEM - puppet last run on mw1202 is CRITICAL puppet fail [22:53:01] (sorry everyone) [22:53:07] PROBLEM - puppet last run on labmon1001 is CRITICAL puppet fail [22:53:07] PROBLEM - puppet last run on cp1048 is CRITICAL puppet fail [22:53:07] PROBLEM - puppet last run on db2037 is CRITICAL puppet fail [22:53:16] PROBLEM - puppet last run on mw1258 is CRITICAL puppet fail [22:53:16] PROBLEM - puppet last run on wtp1018 is CRITICAL puppet fail [22:53:16] PROBLEM - puppet last run on labsdb1006 is CRITICAL puppet fail [22:53:16] PROBLEM - puppet last run on plutonium is CRITICAL puppet fail [22:53:17] PROBLEM - puppet last run on rubidium is CRITICAL puppet fail [22:53:17] PROBLEM - puppet last run on virt1007 is CRITICAL puppet fail [22:53:17] PROBLEM - puppet last run on ms-be1008 is CRITICAL puppet fail [22:53:18] PROBLEM - puppet last run on mw2151 is CRITICAL puppet fail [22:53:18] PROBLEM - puppet last run on mw2191 is CRITICAL puppet fail [22:53:19] PROBLEM - puppet last run on es2007 is CRITICAL puppet fail [22:53:19] PROBLEM - puppet last run on mw2065 is CRITICAL puppet fail [22:53:26] PROBLEM - puppet last run on ms-be2005 is CRITICAL puppet fail [22:53:26] PROBLEM - puppet last run on mw2051 is CRITICAL puppet fail [22:53:27] PROBLEM - puppet last run on mc2006 is CRITICAL puppet fail [22:53:27] PROBLEM - puppet last run on mw1181 is CRITICAL puppet fail [22:53:27] PROBLEM - puppet last run on osmium is CRITICAL puppet fail [22:53:27] PROBLEM - puppet last run on db1071 is CRITICAL puppet fail [22:53:27] PROBLEM - puppet last run on mw1165 is CRITICAL puppet fail [22:53:28] PROBLEM - puppet last run on argon is CRITICAL puppet fail [22:53:28] PROBLEM - puppet last run on cp3009 is CRITICAL puppet fail [22:53:29] PROBLEM - puppet last run on mw1198 is CRITICAL puppet fail [22:53:29] PROBLEM - puppet last run on mw1074 is CRITICAL puppet fail [22:53:30] PROBLEM - puppet last run on mw1034 is CRITICAL puppet fail [22:53:30] PROBLEM - puppet last run on elastic1015 is CRITICAL puppet fail [22:53:34] puppet shaming [22:53:43] yes. [22:53:47] PROBLEM - puppet last run on db2066 is CRITICAL puppet fail [22:53:47] PROBLEM - puppet last run on mw2094 is CRITICAL puppet fail [22:53:47] PROBLEM - puppet last run on mw2042 is CRITICAL puppet fail [22:53:47] PROBLEM - puppet last run on mw1248 is CRITICAL puppet fail [22:53:47] PROBLEM - puppet last run on mw1163 is CRITICAL puppet fail [22:53:54] "doctor, it hurts when i break puppet." [22:53:56] PROBLEM - puppet last run on analytics1023 is CRITICAL puppet fail [22:53:56] PROBLEM - puppet last run on mw1156 is CRITICAL puppet fail [22:53:57] PROBLEM - puppet last run on db1062 is CRITICAL puppet fail [22:53:57] PROBLEM - puppet last run on lvs2003 is CRITICAL puppet fail [22:53:57] PROBLEM - puppet last run on wtp2009 is CRITICAL puppet fail [22:53:57] PROBLEM - puppet last run on hafnium is CRITICAL puppet fail [22:53:57] PROBLEM - puppet last run on mw2189 is CRITICAL puppet fail [22:53:58] PROBLEM - puppet last run on db2023 is CRITICAL puppet fail [22:53:58] PROBLEM - puppet last run on mw1116 is CRITICAL puppet fail [22:53:59] PROBLEM - puppet last run on mw1056 is CRITICAL puppet fail [22:54:45] am tailing the log to see if something else comes up [22:56:45] (03PS1) 10Dereckson: Enable WikiLove on hy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207662 (https://phabricator.wikimedia.org/T97563) [23:00:04] RoanKattouw, ^d, Krinkle, James_F, bmansurov, jdlrobson, aude, Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150429T2300). Please do the needful. [23:00:12] (03PS3) 10Ori.livneh: wikimedia.vcl: issue an HTTP 204 for all beacon/* requests [puppet] - 10https://gerrit.wikimedia.org/r/206351 [23:00:22] bblack: ^ [23:00:38] Alright, I'll start SWAT in a few minutes [23:00:41] pong [23:00:45] * aude here [23:01:53] (the CRITs have died down but I’ll wait for the recovery flood too [23:02:17] 'evening [23:02:40] ori: no statsv? dying anyways? [23:02:59] Can I get someone to merge this? https://gerrit.wikimedia.org/r/#/c/207660/ I need to get it on the swat [23:03:12] bblack: see commit message ;) [23:03:37] also, nice economy of gerrit changeids :) [23:04:09] ty ori [23:05:56] here [23:06:34] (03PS4) 10BBlack: wikimedia.vcl: issue an HTTP 204 for all beacon/* requests [puppet] - 10https://gerrit.wikimedia.org/r/206351 (owner: 10Ori.livneh) [23:07:37] RoanKattouw: can I add https://gerrit.wikimedia.org/r/#/c/207666/ ? [23:08:46] Sure [23:08:48] (03CR) 10BBlack: [C: 032] wikimedia.vcl: issue an HTTP 204 for all beacon/* requests [puppet] - 10https://gerrit.wikimedia.org/r/206351 (owner: 10Ori.livneh) [23:09:23] (thanks bblack!) [23:09:44] (Recoveries in the ircecho log atm) [23:09:47] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1247668 (10BBlack) [23:09:50] RoanKattouw: ty [23:09:59] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1190727 (10BBlack) [23:10:02] added [23:11:07] Alright, config patches first [23:11:32] (03CR) 10Catrope: [C: 032] Enable use of subscriptions table on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207535 (owner: 10Aude) [23:12:02] (03PS1) 10GWicke: VE: Load HTML directly from RESTBase for all Wikipedias, take three [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207671 [23:12:08] yay, thanks [23:13:33] (03CR) 10Catrope: [C: 032] Prevent new wikis from using Graph: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 (owner: 10Dereckson) [23:13:48] RoanKattouw: don't merge 206777 [23:13:54] I won't [23:13:54] (03CR) 10GWicke: "Should not be deployed before https://gerrit.wikimedia.o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207671 (owner: 10GWicke) [23:13:59] I would like to test 206776 first [23:14:18] Dereckson: Hmm why does wmgUseGraphWithNamespace not use the same list of wikis as wmgUseGraph? [23:14:31] Specifically labswiki and outreachwiki [23:14:53] Krenair noticed labs. and outreach. don't use the namespaces currently. [23:15:10] Yeah I just checked and they're both empty [23:15:47] OK I'll push these out now [23:16:23] Ugh they're not merged yet [23:16:29] * RoanKattouw glares at Jenkins [23:17:29] (03PS4) 10Dereckson: Enable Graph extension on se.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206777 (https://phabricator.wikimedia.org/T97027) [23:17:40] (03Merged) 10jenkins-bot: Enable use of subscriptions table on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207535 (owner: 10Aude) [23:17:42] (03Merged) 10jenkins-bot: Prevent new wikis from using Graph: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 (owner: 10Dereckson) [23:17:57] (03CR) 10Dereckson: "PS4: s/svwikimedia/sewikimedia (chapter is se., language sv.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206777 (https://phabricator.wikimedia.org/T97027) (owner: 10Dereckson) [23:18:47] !log catrope Synchronized wmf-config/Wikibase.php: Enable use of subscriptions table on testwikidata (duration: 00m 31s) [23:18:55] Logged the message, Master [23:20:02] !log catrope Synchronized wmf-config/InitialiseSettings.php: Add wmgUseGraphWithNamespace (duration: 00m 28s) [23:20:08] Logged the message, Master [23:20:24] Testing. [23:20:42] Dereckson: Hold on [23:20:51] CommonSettings.php is still in progress [23:20:58] !log catrope Synchronized wmf-config/CommonSettings.php: Disable Graph namespace on all wikis except the ones that already have it (duration: 00m 22s) [23:21:03] Logged the message, Master [23:21:31] Dereckson: There --^^ [23:21:37] Ok. [23:21:45] Yup, still works on mediawiki. and meta. [23:22:22] Cool [23:22:27] Ready for the other one? [23:22:30] Yes. [23:22:45] (03CR) 10Catrope: [C: 032] Enable Graph extension on se.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206777 (https://phabricator.wikimedia.org/T97027) (owner: 10Dereckson) [23:22:51] (03Merged) 10jenkins-bot: Enable Graph extension on se.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206777 (https://phabricator.wikimedia.org/T97027) (owner: 10Dereckson) [23:22:58] RoanKattouw: got the Wikibase-labs.php one also? [23:23:01] no hurry [23:23:27] aude: That doesn't need to be synced on the cluster, does it? [23:23:36] ah, right [23:23:56] nothing looks broken, so think it's all good [23:24:06] !log catrope Synchronized wmf-config/InitialiseSettings.php: Enable Graph extension on sewikimedia (duration: 00m 21s) [23:24:08] alright, the spam seems to have passed [23:24:11] Logged the message, Master [23:24:16] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:25:13] bblack: merged your changes... [23:25:16] hope that’s ok [23:25:17] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1247720 (10BBlack) >>! In T97204#1238421, @GWicke wrote: > @bblack, I haven't seen anything explicitly stating that `Retry-After` is global to a service. The... [23:25:28] bblack: just realized that’s a VCL change and you might have wanted to do something funky [23:25:37] hmm, or not. [23:25:44] anyway, is merged now :) [23:25:56] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [23:26:49] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1247725 (10BBlack) Perhaps confusion on that last point above is about timeouts vs 503s. I think reactions to hard timeouts and 503s need to have the same b... [23:27:13] Dereckson: sewikimedia done; I suppose you may not be able to test that though [23:27:22] Yes, I were, they don't block wiki edit. [23:27:29] https://se.wikimedia.org/w/index.php?title=Anv%C3%A4ndare:Dereckson/Sandbox&action=history [23:27:38] Excellent [23:27:41] YuviPanda: yeah that's fine, I just completely forgot while juggling other things [23:27:41] Extension is on Special:Version, but nothing happens. [23:27:49] I'll inquire with yurik about that. [23:28:06] brb [23:28:58] bmansurov: jdlrobson: One of you around for your SWAT? [23:28:59] RoanKattouw: oh graph has appeared finally! [23:29:02] So tested fine. [23:29:06] yup [23:29:08] bmansurov: jdlrobson: Because https://gerrit.wikimedia.org/r/#/c/207634/ does not cherry-pick cleanly [23:29:26] RoanKattouw: which branch? [23:29:31] wmf3 [23:29:31] yurik: any idea why it needs 3/4 minutes to get the Vega library to load? Resource loader caching resources? [23:29:37] is what it was listed for in SWAT [23:29:41] Dereckson: Yeah that seems plausible [23:29:48] Dereckson: RL stuff can take 5-10 minutes [23:30:06] Ok. [23:30:17] RoanKattouw: should be wmf4 i think [23:30:19] let me double check [23:30:38] RoanKattouw: yeh it's wmf4 [23:30:41] Oh OK [23:30:44] The deployment page said 3 [23:30:48] I'll try it with 4 [23:30:48] guess bmansurov made a mistake [23:30:52] it's only impacting mediawiki.org [23:31:05] OK that worked [23:31:10] (03CR) 10Dereckson: [C: 04-1] "Waiting translation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207662 (https://phabricator.wikimedia.org/T97563) (owner: 10Dereckson) [23:31:17] RoanKattouw: W00t [23:32:41] bd808: when your AffCom contact page patch will be deploy, the Wikimedia-Site-Request to deploy queue will be empty :) [23:33:06] (against a peak at 16 last week) [23:33:23] Dereckson: cool. Brad made some great changes but now I have more work to do [23:33:35] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1247738 (10GWicke) >>! In T97204#1247720, @BBlack wrote: > All the complications of defining a virtual "service" as an entity orthogonal to the hierarchy of... [23:33:38] * bd808 crosses fingers he can get it don't mostly on time [23:35:03] (03PS1) 10Springle: script used for non-replicated dbstore backups [puppet] - 10https://gerrit.wikimedia.org/r/207680 (https://phabricator.wikimedia.org/T95835) [23:35:34] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1247744 (10GWicke) > I think reactions to hard timeouts and 503s need to have the same behaviors from a consuming service's perspective. This would mean tha... [23:48:24] rmoen: Your Gather deploy is underwya [23:48:33] RoanKattouw: thanks [23:48:43] !log catrope Synchronized php-1.26wmf3/extensions/Gather: SWAT (duration: 00m 32s) [23:48:50] Logged the message, Master [23:48:53] James_F: WikiEditor sampling deploy underway [23:49:02] RoanKattouw: Ta. [23:49:07] !log catrope Synchronized php-1.26wmf3/extensions/WikiEditor: SWAT (duration: 00m 23s) [23:49:12] Logged the message, Master [23:49:13] RoanKattouw: Confirmed fixed [23:49:16] gwicke: "Disable _prefetch" deploy underway [23:49:20] thanks [23:49:46] !log catrope Synchronized php-1.26wmf3/extensions/VisualEditor: SWAT (duration: 00m 39s) [23:49:52] Logged the message, Master [23:50:24] RoanKattouw: yay [23:50:24] Krinkle: jQuery update underway [23:50:50] !log catrope Synchronized php-1.26wmf3/resources/lib/jquery/jquery.js: Update jQuery to 1.11.3 (duration: 00m 31s) [23:50:55] Logged the message, Master [23:51:05] (03CR) 10Catrope: [C: 032] VE: Load HTML directly from RESTBase for all Wikipedias, take three [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207671 (owner: 10GWicke) [23:52:14] Thanks for the deploy RoanKattouw. [23:52:40] Dereckson: A pleasure as always [23:52:48] RoanKattouw: thx [23:54:10] (03Merged) 10jenkins-bot: VE: Load HTML directly from RESTBase for all Wikipedias, take three [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207671 (owner: 10GWicke) [23:57:03] !log catrope Synchronized php-1.26wmf4/extensions/MobileFrontend: SWAT (duration: 00m 33s) [23:57:08] Logged the message, Master [23:57:13] jdlrobson: bmansurov: ---^^ [23:57:19] testomg [23:58:00] jdlrobson: That's a hilarious typo [23:58:03] RoanKattouw: works with debug=true [23:58:06] but not without [23:58:07] !log catrope Synchronized wmf-config/InitialiseSettings.php: Enable direct RESTbase load on all Wikipedias (duration: 00m 21s) [23:58:12] RoanKattouw: :) [23:58:12] Logged the message, Master [23:58:23] jdlrobson: Could take 5-10 mins before debug=false starts working [23:58:33] yup [23:58:35] i'll wait it out [23:58:38] RoanKattouw: thank you! [23:58:42] but looks good so thanks