[00:00:05] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171215T0000). [00:00:05] Jhs and MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:15] sup [00:01:08] (03PS2) 10Ayounsi: [WIP] Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 [00:03:36] I'm here [00:03:58] (03PS2) 10Dzahn: dbstore: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398390 (https://phabricator.wikimedia.org/T177225) [00:04:48] I can do the SWAT [00:06:14] (03PS2) 10Catrope: Set category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396381 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [00:06:21] (03CR) 10Catrope: [C: 032] Set category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396381 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [00:06:51] RoanKattouw, this one depends on the one MatmaRex added (no. 3 in his list) [00:06:58] Yup I just noticed [00:06:59] (03CR) 10Dzahn: [C: 032] dbstore: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398390 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:07:01] (Y) [00:07:03] Doing that one now [00:07:44] (03Merged) 10jenkins-bot: Set category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396381 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [00:08:04] RoanKattouw: i'm not sure if we need the wmf.11 ones. it's not live anywhere [00:08:06] All wikis are on wmf.12 right now though so it's safe to deploy the config change now [00:08:10] (03CR) 10jenkins-bot: Set category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396381 (https://phabricator.wikimedia.org/T181503) (owner: 10Jon Harald Søby) [00:08:20] RoanKattouw: but no_justification won't tell me straight whether he's about to revert to that or not [00:08:22] MatmaRex: I'll do them anyway for paranoia, this is the last SWAT for a week or two [00:08:35] So just in case there's an emergency revert over Christmas, I want to be safe and have those patches in [00:08:39] yeah [00:08:47] sounds smart [00:10:12] The reason he won't tell you is because he doesn't know [00:10:16] Jhs: Is there a category on sewiki where we can see the difference between the new and old collations? [00:10:22] If the shit hits the fan in a few hours... it'll get reverted [00:11:17] RoanKattouw, https://se.wikipedia.org/wiki/Kategoriija:S%C3%A1megielaid_alfabehta [00:11:29] Ha thanks [00:11:42] What is the correct result for that one? [00:11:49] Á between A and B? [00:11:53] Special letters immediately after the letters they look like [00:11:58] OK cool [00:13:04] MatmaRex: You asked if I was planning to revert. "No, I don't. But I can't promise that" (as in, if things keep breaking, I could, but I'm not planning to). But I guess that wasn't clear? [00:13:13] You then asked if it would be a good idea to deploy to wmf.11 as well [00:13:15] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Set category collation for sewiki to uppercase-se (duration: 00m 57s) [00:13:16] To which I said yes [00:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:31] !log Running updateCollation.php on sewiki [00:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:43] Er, "It certainly can't hurt" [00:13:47] Which is basically a yes [00:13:51] basically [00:14:13] !log updateCollation.php finished on sewiki (T181503) [00:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:22] T181503: Add proper category collation for the Northern Sami Wikipedia - https://phabricator.wikimedia.org/T181503 [00:14:31] Alright, https://se.wikipedia.org/wiki/Kategoriija:S%C3%A1megielaid_alfabehta looks good to me now [00:14:42] RoanKattouw, yup, to me too [00:15:18] thank you very much, and esp. to you too, MatmaRex! :) [00:15:31] anytime. collations are fun [00:16:21] indeed! and so much more complicated than what i thought they would be :P [00:17:42] RoanKattouw: if we have time left over, how do you feel about merging and backporting https://gerrit.wikimedia.org/r/#/c/397974/ too? [00:17:55] I was just going to ask you exactly that question [00:18:12] So let's do it :) [00:18:37] thanks :D [00:25:06] !log catrope@tin Synchronized php-1.31.0-wmf.12/resources/lib/oojs-ui/oojs-ui-core.js: Backport OOjs UI fix for T182359, T182395 (duration: 00m 57s) [00:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:16] T182359: [wmf.11] "Uncaught TypeError: Cannot read property 'css' of undefined" when topics are sorted on SD boards - https://phabricator.wikimedia.org/T182359 [00:25:16] T182395: mwext-qunit-jessie Jenkins job fails for MultimediaViewer (Share/Embed button not working) - https://phabricator.wikimedia.org/T182395 [00:29:15] live everywhere? [00:29:37] bar caching [00:29:41] (seems to be) [00:29:47] Yes it is [00:30:06] those annoying 5 minutes where you're not sure if it's just caching or if everything's broken [00:30:06] Still waiting for the others to go through Jenkins [00:30:21] but looks fixed now :) [00:31:02] I did test it on mwdebug1002 beforehand and it worked :) [00:33:08] !log catrope@tin Synchronized php-1.31.0-wmf.12/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.ArticleTargetEvents.js: Track editor mode on save events (T182610) (duration: 00m 56s) [00:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:20] T182610: Implement data collection for performance metrics - https://phabricator.wikimedia.org/T182610 [00:34:22] RoanKattouw: any problems with ORES filters? I do not see a single record on RC marked with ORES scores in any wiki [00:34:37] Lemm echeck [00:34:40] (03PS1) 10EBernhardson: Dont discount file searches on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398394 [00:35:05] I can confirm what you see [00:35:13] 10Operations, 10Mail, 10Toolforge, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812#3839078 (10bd808) ``` $ exim4 -bt security@tools.wmflabs.org ... bdavis@wikimedia.org <-- admin.maintainers@tools.wmflabs.org <-- tools.admin@tools... [00:35:44] RoanKattouw: sigh [00:36:09] Hmm [00:36:22] The most recent entry in the ores_classification table is 2017-12-15 00:35:33 [00:36:28] So that's still updating [00:37:10] The data is in the DB but somehow the ORES extension isn't picking it up [00:37:20] I suspect that Amir's recent change might be the culprit [00:37:56] RoanKattouw: ok [00:39:04] (03PS13) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [00:39:08] Whoa, https://gerrit.wikimedia.org/r/#/c/398383/ looks super suspicious [00:39:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [00:40:00] etonkovidova: Is there a task? [00:40:11] RoanKattouw: not yet [00:40:14] I personally suspect https://gerrit.wikimedia.org/r/#/c/395811/ but there were other changes too [00:42:26] Now running a test request with a "very likely good" filter on mwdebug1002 to gather logs in logstash and see what the SQL query is [00:43:07] RoanKattouw: https://phabricator.wikimedia.org/T182936 [00:43:42] Thanks [00:44:13] !log catrope@tin Synchronized php-1.31.0-wmf.12/resources/src/mediawiki.rcfilters/ui/mw.rcfilters.ui.MenuSelectWidget.js: T182711 (duration: 00m 56s) [00:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:23] T182711: Recent changes: Filter menu opens upwards - https://phabricator.wikimedia.org/T182711 [00:47:28] OK I found the culprit [00:53:31] (03PS1) 10EddieGP: Move wikipedia.org to www.wikipedia.org vhost [puppet] - 10https://gerrit.wikimedia.org/r/398396 [00:54:43] (03PS1) 10Dzahn: external storage codfw: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398398 (https://phabricator.wikimedia.org/T177225) [00:56:46] (03PS2) 10Catrope: Don't discount file searches on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398394 (owner: 10EBernhardson) [00:56:48] (03PS3) 10Catrope: Don't discount file searches on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398394 (owner: 10EBernhardson) [00:56:52] (03CR) 10Catrope: [C: 032] Don't discount file searches on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398394 (owner: 10EBernhardson) [00:58:06] (03Merged) 10jenkins-bot: Don't discount file searches on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398394 (owner: 10EBernhardson) [00:58:15] (03CR) 10jenkins-bot: Don't discount file searches on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398394 (owner: 10EBernhardson) [01:03:58] 10Operations, 10Beta-Cluster-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1526857 (10Dzahn) Can this ticket be made public nowadays? There is now T182927 and it seemed like a duplicate. [01:04:03] (03PS1) 10EddieGP: beta: Combine meta & commons vhosts [puppet] - 10https://gerrit.wikimedia.org/r/398399 [01:04:36] mutante: Ahm, if it notifies here the task is probably public, right? [01:06:36] RoanKattouw: i edited the wrong ticket.. amended [01:06:43] there is another one that isnt [01:06:52] OK, so who wants to review https://gerrit.wikimedia.org/r/#/c/398397/3 for immediate deployment? [01:07:32] RoanKattouw: i can look [01:07:34] Thanks [01:07:50] I'd ask Amir but he's probably asleep [01:08:44] (03PS1) 10Chad: Get rid of clearly unloved refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398400 [01:08:50] Reedy: Hehe ^^ [01:10:13] no_justification: BTW here's why that ORES thing happened earlier: https://gerrit.wikimedia.org/r/#/c/398397/3 [01:10:50] Turns out that notice you and Adam squashed was a sign of something much worse, which caused basically all UI display of ORES scores to break [01:11:12] MatmaRex: Thanks for your +2 [01:11:39] !log catrope@tin Synchronized php-1.31.0-wmf.11: Pulling in today's cherry-picks into wmf.11 too (duration: 10m 49s) [01:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:27] RoanKattouw: well our fix will silence the notice if anyone breaks the join again hehe [01:14:35] lol yup [01:14:40] Thankfully etonkovidova noticed [01:14:51] You can silence your logs but you can't silence our QA people ;) [01:15:34] RoanKattouw: that's right! [01:18:49] RoanKattouw: Thanks for keeping our QA people happy! [01:19:02] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Don't discount file searches on commonswiki (duration: 00m 57s) [01:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:34] (03PS2) 10Dzahn: external storage codfw: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398398 (https://phabricator.wikimedia.org/T177225) [01:21:51] (03CR) 10Dzahn: [C: 032] external storage codfw: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398398 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [01:22:12] etonkovidova, awight: Fix rolling out now [01:22:29] etonkovidova: Thanks so much for repeatedly bugging me about this, this was the last opportunity to fix it before the holidays [01:22:37] So I'm glad we managed to get it in just in time [01:22:46] !log Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds [01:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:55] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [01:23:05] !log catrope@tin Synchronized php-1.31.0-wmf.12/extensions/ORES: Fix broken join conditions (T182936) (duration: 00m 57s) [01:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:12] T182936: [wmf.12 -regression] ORES filters do not work - https://phabricator.wikimedia.org/T182936 [01:26:45] OK, that was a fun end-of-year deployment [01:26:57] I gotta run now, had wanted to leave the office over an hour ago [01:27:00] RoanKattouw: the fix works - thx [01:27:12] (03CR) 10Dzahn: "on es2004 and es2001 i had to manually kill the gmond process and run puppet again. on the other ones i didn't and it worked by itself lik" [puppet] - 10https://gerrit.wikimedia.org/r/398398 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [01:38:40] (03PS14) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [01:39:13] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [01:40:41] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3839293 (10Krinkle) [01:59:39] (03PS1) 10Krinkle: build: Remove phplint --ignore-fails hack for symlink support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398408 [02:00:42] (03Abandoned) 10Krinkle: build: Remove phplint --ignore-fails hack for symlink support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398408 (owner: 10Krinkle) [02:08:40] (03CR) 10Krinkle: "Hm.. given it's broken I suppose removal should be trivial. On the other hand, if not verifiable by re-generation being dirty or no-op, wh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398400 (owner: 10Chad) [02:34:30] (03PS1) 10Dzahn: mysql codfw: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398413 (https://phabricator.wikimedia.org/T177225) [02:49:31] (03PS2) 10Dzahn: mysql codfw: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398413 (https://phabricator.wikimedia.org/T177225) [02:50:23] (03CR) 10Dzahn: [C: 032] mysql codfw: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398413 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [03:01:15] !log db2016 thru db2019 - had to manually kill gmond process to decom ganglia, other db codfw hosts: didnt need it | running puppet on db205* and others in codfw to remove all ganglia (T177225) [03:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:29] T177225: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225 [04:53:54] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table dewiki.archive: Cant find record in archive, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001641, end_log_pos 349362942 [05:18:23] PROBLEM - HHVM rendering on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:14] RECOVERY - HHVM rendering on mw2133 is OK: HTTP OK: HTTP/1.1 200 OK - 80385 bytes in 0.311 second response time [05:45:29] (03Draft2) 10Jayprakash12345: Enable Sandbox Extension at Atikamekw Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398417 [05:46:41] (03PS3) 10Jayprakash12345: Enable Sandbox Extension at Atikamekw Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398417 (https://phabricator.wikimedia.org/T182798) [06:20:20] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1109 and db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398419 [06:20:25] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1109 and db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398419 [06:22:48] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1109 and db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398419 (owner: 10Marostegui) [06:24:15] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109 and db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398419 (owner: 10Marostegui) [06:26:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 and db1109 - T161294 (duration: 01m 16s) [06:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:30] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [06:26:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109 and db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398419 (owner: 10Marostegui) [06:29:40] !log Fix dbstore1001 s5 replication [06:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:13] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:36:33] PROBLEM - HHVM rendering on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:23] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 80307 bytes in 0.315 second response time [06:43:55] !log Deploy schema change on dbstore1001 (s1) - T174569 [06:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:05] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:45:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398421 (https://phabricator.wikimedia.org/T174569) [06:46:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398421 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:48:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398421 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:48:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398421 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:49:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 - T174569 (duration: 00m 56s) [06:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:41] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:50:30] !log Deploy schema change on db1083 - T174569 [06:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:14] PROBLEM - nova-compute process on labvirt1012 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [07:32:14] RECOVERY - nova-compute process on labvirt1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [08:15:11] (03PS1) 10Muehlenhoff: Fix account date for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/398424 [08:15:47] (03PS31) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [08:17:04] (03PS1) 10Marostegui: mariadb: Remove db1034 [puppet] - 10https://gerrit.wikimedia.org/r/398425 (https://phabricator.wikimedia.org/T182556) [08:17:28] (03CR) 10Muehlenhoff: [C: 032] Fix account date for pnorman [puppet] - 10https://gerrit.wikimedia.org/r/398424 (owner: 10Muehlenhoff) [08:17:40] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Remove db1034 [puppet] - 10https://gerrit.wikimedia.org/r/398425 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [08:19:53] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add Debianisation for prometheus-rabbitmq-exporter [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398033 (owner: 10Muehlenhoff) [08:19:54] (03PS2) 10Muehlenhoff: Add Debianisation for prometheus-rabbitmq-exporter [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398033 [08:19:58] (03PS2) 10Marostegui: mariadb: Remove db1034 [puppet] - 10https://gerrit.wikimedia.org/r/398425 (https://phabricator.wikimedia.org/T182556) [08:20:00] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add Debianisation for prometheus-rabbitmq-exporter [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398033 (owner: 10Muehlenhoff) [08:21:05] (03PS1) 10Marostegui: s7.hosts: Remove db1034 [software] - 10https://gerrit.wikimedia.org/r/398426 (https://phabricator.wikimedia.org/T182556) [08:21:14] (03PS3) 10Marostegui: mariadb: Remove db1034 [puppet] - 10https://gerrit.wikimedia.org/r/398425 (https://phabricator.wikimedia.org/T182556) [08:22:27] (03CR) 10Marostegui: [C: 032] mariadb: Remove db1034 [puppet] - 10https://gerrit.wikimedia.org/r/398425 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [08:24:26] (03CR) 10Marostegui: [C: 032] s7.hosts: Remove db1034 [software] - 10https://gerrit.wikimedia.org/r/398426 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [08:24:31] (03PS1) 10Marostegui: db1034.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/398427 (https://phabricator.wikimedia.org/T182556) [08:24:48] 10Operations, 10ops-eqiad, 10Analytics: Decommission kafka1018 - https://phabricator.wikimedia.org/T182955#3839641 (10elukey) [08:25:10] (03Merged) 10jenkins-bot: s7.hosts: Remove db1034 [software] - 10https://gerrit.wikimedia.org/r/398426 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [08:25:21] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3792843 (10elukey) kafka1023 is now fully productionized and catching up with the missing partitions. Opened https://phabricator.wikimedia.org/T182955 to d... [08:25:23] (03CR) 10Marostegui: [C: 032] db1034.yaml: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/398427 (https://phabricator.wikimedia.org/T182556) (owner: 10Marostegui) [08:26:13] !log Remove db1034 from tendril - T182556 [08:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:26] T182556: Decommission db1034 - https://phabricator.wikimedia.org/T182556 [08:28:34] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:28:35] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:28:43] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:28:43] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [08:28:43] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:54] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:54] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [08:29:03] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [08:29:04] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [08:29:04] RECOVERY - Host bohrium is UP: PING WARNING - Packet loss = 64%, RTA = 1.34 ms [08:29:04] RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [08:29:13] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [08:29:13] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [08:30:53] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 2.34 ms [08:31:03] PROBLEM - spamassassin on fermium is CRITICAL: PROCS CRITICAL: 0 processes with args spamd [08:31:53] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 305770 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:32:03] RECOVERY - spamassassin on fermium is OK: PROCS OK: 3 processes with args spamd [08:32:04] PROBLEM - etc request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 273245 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:32:53] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 5789 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:33:05] RECOVERY - etc request latencies on argon is OK: OK - etcd_request_latencies is 4398 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:37:39] (03PS2) 10Elukey: base: fix dependency relationship [puppet] - 10https://gerrit.wikimedia.org/r/398303 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [08:38:24] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1034 - https://phabricator.wikimedia.org/T182556#3839663 (10Marostegui) [08:38:44] !log Stop MySQL on db1034 to decommission it - T182556 [08:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:57] T182556: Decommission db1034 - https://phabricator.wikimedia.org/T182556 [08:42:44] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:42:44] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:42:53] PROBLEM - Host ganeti1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:42:56] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1034 - https://phabricator.wikimedia.org/T182556#3839665 (10Marostegui) a:05Marostegui>03Cmjohnson @Cmjohnson this host is fully ready to be decommissioned [08:43:04] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:43:04] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:43:04] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:43:04] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [08:43:04] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [08:43:04] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:43:04] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:43:13] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:44:04] <_joe_> wat? [08:44:08] <_joe_> oh ganeti [08:44:17] <_joe_> shit this is happening more and more [08:45:02] dysprosium doesn't have the ganeti comment in the DNS and made me worry for a second [08:47:11] I'm powercycling ganeti1005 via mgmt [08:47:21] ah was about to do it [08:47:39] a ton of printks in the console as happened the last time [08:48:12] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [08:48:24] (03PS9) 10Elukey: role::cache::canary: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 [08:50:02] !log powercycling ganeti1005 [08:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:13] RECOVERY - Host ganeti1005 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [08:50:33] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 4.66 ms [08:50:43] RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 6.74 ms [08:50:43] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 5.80 ms [08:50:53] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 6.50 ms [08:50:53] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 6.56 ms [08:51:03] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 7.96 ms [08:51:03] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 7.93 ms [08:51:03] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 8.32 ms [08:51:03] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 8.22 ms [08:51:23] RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 6.55 ms [08:51:42] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3839678 (10MoritzMuehlenhoff) Happened again on ganeti1005. [08:52:22] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3839679 (10MoritzMuehlenhoff) >>! In T181121#3831708, @Cmjohnson wrote: > I updated the Bios Version to 2.6 Which host had the BIOS updated? All of 1005-1008 or just one of them? [08:53:23] PROBLEM - Check systemd state on bohrium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:54:39] mysql service failed, lovely [08:55:37] InnoDB: Your database may be corrupt or you may have copied the InnoDB [08:55:42] * elukey cries in a corner [08:59:21] last backup is two days ago [09:00:12] (03PS1) 10Muehlenhoff: Add rabbitmq-exporter to Prometheus scraper config for labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/398428 [09:05:10] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add Prometheus exporter for WDQS Updater [debs/prometheus-wdqs-updater-exporter] - 10https://gerrit.wikimedia.org/r/398072 (owner: 10Muehlenhoff) [09:05:21] elukey: ugh, I'm not an expert but any chance the existing copy could be recovered? [09:05:24] (03PS5) 10Muehlenhoff: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) [09:05:31] elukey: can I help? [09:05:38] (03PS2) 10Muehlenhoff: Add Prometheus exporter to WDQS servers [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) [09:05:45] marostegui: if you have mercy I'd be really happy :D [09:06:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398429 [09:06:11] sure, give me a sec to finish up some stuff first [09:06:20] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add Debianisation for prometheus-blazegraph-exporter [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398277 (owner: 10Muehlenhoff) [09:06:26] godog: it happened also a while ago after a ganeti crash, we didn't manage to to much :( [09:06:28] (03PS2) 10Muehlenhoff: Add Debianisation for prometheus-blazegraph-exporter [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398277 [09:06:33] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add Debianisation for prometheus-blazegraph-exporter [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398277 (owner: 10Muehlenhoff) [09:07:23] elukey: yes, it looks quite broken, you might need to restore the backup indeed [09:07:48] :( [09:08:15] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398429 (owner: 10Marostegui) [09:09:32] marostegui: what did we do last time? innodb files wipe, start, import? [09:09:38] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398429 (owner: 10Marostegui) [09:09:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398429 (owner: 10Marostegui) [09:10:14] elukey: yeah, let me do that [09:10:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1083 - T174569 (duration: 00m 56s) [09:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:57] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:11:25] (03CR) 10Urbanecm: [C: 031] "From the code side: Okay. From the other side: It makes no sense to me, it cannot block somebody from their work but also not help anyone." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397768 (https://phabricator.wikimedia.org/T182541) (owner: 10EddieGP) [09:12:21] marostegui: nono I didn't mean "you do it" :D [09:12:29] I can try to, so I'll learn something [09:13:07] let's do it together as there are a few things you need to keep in mind ;) [09:14:25] (03CR) 10TerraCodes: [C: 031] Restrict sending mails to new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/397768 (https://phabricator.wikimedia.org/T182541) (owner: 10EddieGP) [09:17:24] RECOVERY - Check systemd state on bohrium is OK: OK - running: The system is fully operational [09:20:24] PROBLEM - Check systemd state on bohrium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:21:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398430 (https://phabricator.wikimedia.org/T174569) [09:22:46] 10Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#3839715 (10jcrespo) This is provisionally a goal, and one that should be doable early next quarter. [09:24:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398430 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:25:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398430 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:26:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1080 - T174569 (duration: 00m 56s) [09:26:51] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398430 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [09:26:54] !log Deploy schema change on db1080 - T174569 [09:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:54] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:40] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3839728 (10jcrespo) Do you have a set of instructions you run, so I can reproduce and I can check TLS is effectively enabled w... [09:36:14] (03PS1) 10Muehlenhoff: * Rename metrics prefix to use pdns_rec instead of pdns_ [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/398431 [09:37:35] (03CR) 10Filippo Giunchedi: Add Prometheus exporter for Blazegraph (035 comments) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398272 (https://phabricator.wikimedia.org/T182857) (owner: 10Muehlenhoff) [09:38:44] (03CR) 10Filippo Giunchedi: [C: 031] * Rename metrics prefix to use pdns_rec instead of pdns_ [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/398431 (owner: 10Muehlenhoff) [09:39:40] (03CR) 10Filippo Giunchedi: [C: 031] profile::hadoop::prometheus_jmx_exporter: blacklist unwanted Mbeans [puppet] - 10https://gerrit.wikimedia.org/r/398282 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [09:41:11] (03CR) 10Muehlenhoff: [V: 032 C: 032] * Rename metrics prefix to use pdns_rec instead of pdns_ [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/398431 (owner: 10Muehlenhoff) [09:42:24] RECOVERY - Check systemd state on bohrium is OK: OK - running: The system is fully operational [09:44:08] (03CR) 10Filippo Giunchedi: Add rabbitmq-exporter to Prometheus scraper config for labmon1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398428 (owner: 10Muehlenhoff) [09:45:49] (03CR) 10Filippo Giunchedi: Add Prometheus RabbitMQ exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [09:50:44] !log restore piwik database on bohrium after mysql corruption - piwik disabled [09:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:09] (03CR) 10Elukey: [C: 032] role::cache::canary: add a test Varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/397765 (owner: 10Elukey) [09:55:23] (03PS1) 10Marostegui: mariadb: Add db1111 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/398432 (https://phabricator.wikimedia.org/T180788) [09:55:50] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3839764 (10Marostegui) [09:57:35] (03CR) 10Muehlenhoff: Add Prometheus RabbitMQ exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [09:57:47] (03PS6) 10Muehlenhoff: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) [09:58:56] (03PS2) 10Muehlenhoff: Add rabbitmq-exporter to Prometheus scraper config for labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/398428 [09:59:00] (03CR) 10Muehlenhoff: Add rabbitmq-exporter to Prometheus scraper config for labmon1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398428 (owner: 10Muehlenhoff) [09:59:37] (03PS1) 10Elukey: profile::cache::kafka::webrequest::jumbo: fix private key path in private repo [puppet] - 10https://gerrit.wikimedia.org/r/398433 [09:59:51] (03CR) 10Elukey: [C: 032] profile::cache::kafka::webrequest::jumbo: fix private key path in private repo [puppet] - 10https://gerrit.wikimedia.org/r/398433 (owner: 10Elukey) [10:00:17] ah I might get a -1 for the commit msg too long [10:02:40] (03CR) 10Jcrespo: "This is ok for now, but I am not sure it will be in the long term- notifications disables is supposed to be a transitioning state, and it " [puppet] - 10https://gerrit.wikimedia.org/r/398432 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [10:03:13] (03PS1) 10Elukey: profile::cache::kafka::webrequest::jumbo: fix require parameter [puppet] - 10https://gerrit.wikimedia.org/r/398434 [10:04:08] (03CR) 10Jcrespo: [C: 032] Revert "site.pp: Failover labsdb1011 to labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/398246 (owner: 10Marostegui) [10:04:22] (03PS3) 10Jcrespo: Revert "site.pp: Failover labsdb1011 to labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/398246 (owner: 10Marostegui) [10:04:37] (03CR) 10Elukey: [C: 032] profile::cache::kafka::webrequest::jumbo: fix require parameter [puppet] - 10https://gerrit.wikimedia.org/r/398434 (owner: 10Elukey) [10:04:51] mobrovac: I was looking at the trendingedits patch, any reason why we'd do it today as opposed to monday? [10:05:41] (03CR) 10Marostegui: "> This is ok for now, but I am not sure it will be in the long term-" [puppet] - 10https://gerrit.wikimedia.org/r/398432 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [10:10:04] (03CR) 10Jcrespo: "It is not only that- technically, this is not part of core- so it should not be shown as part of the traffic there. There are several issu" [puppet] - 10https://gerrit.wikimedia.org/r/398432 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [10:12:18] godog: yeah, the service is not reachable from the public and i think it's better to stop/disable it today so that it doesn't fail during the weekend [10:13:15] mobrovac: ugh, trendingedit fails by itself even without traffic? [10:14:39] yeah, it never failed because of traffic, but because its offsets in kafka would disappear heh [10:14:44] (03PS4) 10Jcrespo: Revert "site.pp: Failover labsdb1011 to labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/398246 (owner: 10Marostegui) [10:14:47] !log uploaded prometheus-dns-rec-exporter 0.3 to apt.wikimedia.org [10:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:05] godog: perhaps just manually masking it would be enough? [10:15:14] oh no, we cant do that [10:15:25] puppet would start failing then [10:19:06] godog: i have to go run an errand now, so let's reconvene in the afternoon? but i'd really like to go with that patch as it's low-risk and easily reverted/fixable in case of problems, especially given that it affects a service that will no longer exist in a couple of days [10:20:01] mobrovac: ok ttyl! yeah I was thinking we'll need to silence pybal etc too when the service is stopped etc [10:21:07] gr8! thnx godog [10:28:52] (03PS2) 10Marostegui: mariadb: Add db1111 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/398432 (https://phabricator.wikimedia.org/T180788) [10:31:17] !log rolling restart of yarn nodemanagers on an103* to apply new config - T182276 [10:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:27] T182276: Enable more accurate smaps based RSS tracking by yarn nodemanager - https://phabricator.wikimedia.org/T182276 [10:35:23] (03CR) 10Marostegui: [C: 032] mariadb: Add db1111 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/398432 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [10:41:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398436 (https://phabricator.wikimedia.org/T180788) [10:41:59] (03PS2) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398436 (https://phabricator.wikimedia.org/T180788) [10:45:27] (03PS3) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398436 (https://phabricator.wikimedia.org/T180788) [10:47:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398436 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [10:48:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398436 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [10:48:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398436 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [10:49:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1084 to clone db1111 - T180788 (duration: 00m 56s) [10:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:04] T180788: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788 [10:50:13] (03CR) 10Gehel: Add Prometheus exporter to WDQS servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398073 (https://phabricator.wikimedia.org/T182773) (owner: 10Muehlenhoff) [10:50:53] !log Stop MySQL on db1084 to clone db1111 - T180788 [10:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:04] (03PS1) 10Elukey: profile::cache::kafka::webrequest::jumbo: fix TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/398437 [10:51:40] (03CR) 10Elukey: [C: 032] profile::cache::kafka::webrequest::jumbo: fix TLS configuration [puppet] - 10https://gerrit.wikimedia.org/r/398437 (owner: 10Elukey) [10:54:33] (03PS1) 10Elukey: profile::cache::kafka::webrequest::jumbo: fix TLS config [puppet] - 10https://gerrit.wikimedia.org/r/398439 [10:54:56] yes I know, difficult Friday [10:55:08] (03CR) 10Elukey: [C: 032] profile::cache::kafka::webrequest::jumbo: fix TLS config [puppet] - 10https://gerrit.wikimedia.org/r/398439 (owner: 10Elukey) [10:55:14] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398440 [10:55:17] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398440 [10:57:01] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398440 (owner: 10Marostegui) [10:58:19] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398440 (owner: 10Marostegui) [10:58:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398440 (owner: 10Marostegui) [10:58:48] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for PowerDNS - https://phabricator.wikimedia.org/T182970#3840003 (10MoritzMuehlenhoff) p:05Triage>03High [10:59:04] !log reloading dbproxy1011 [10:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1080 - T174569 (duration: 00m 56s) [10:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:37] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [11:05:21] (03PS1) 10Ema: prometheus: add xcps aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/398441 (https://phabricator.wikimedia.org/T177199) [11:10:17] (03CR) 10Filippo Giunchedi: "The global instance has a set of regexp of metrics to be pulled from site-local Prometheus, likely it'll need some tuning too" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398441 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [11:16:34] (03PS2) 10Ema: prometheus: add xcps aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/398441 (https://phabricator.wikimedia.org/T177199) [11:18:07] (03CR) 10Ema: prometheus: add xcps aggregation rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398441 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [11:24:46] (03PS1) 10Muehlenhoff: Add .gitreview file [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398442 [11:27:10] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add .gitreview file [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398442 (owner: 10Muehlenhoff) [11:27:25] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3840062 (10ema) [11:40:26] (03PS3) 10Ema: prometheus: add xcps aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/398441 (https://phabricator.wikimedia.org/T177199) [11:42:12] (03PS1) 10Gehel: elasticsearch: move prometheus classes to conform with best practice [puppet] - 10https://gerrit.wikimedia.org/r/398444 [11:42:30] moritzm: ^ cleanup according to discussion [11:44:07] ok, will have a look into a bit [11:48:34] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: add xcps aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/398441 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [11:54:55] (03CR) 10Ema: [C: 032] prometheus: add xcps aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/398441 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [12:00:37] (03PS1) 10Elukey: profile::cache::kafka::webrequest::jumbo: force TLS kafka broker port [puppet] - 10https://gerrit.wikimedia.org/r/398446 [12:04:29] (03PS1) 10Muehlenhoff: Add a Prometheus exporter for PowerDNS [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/398447 (https://phabricator.wikimedia.org/T182970) [12:05:45] (03PS1) 10Mforns: Add EL Edit's schema action.loaded.timing field to purging whitelist [puppet] - 10https://gerrit.wikimedia.org/r/398448 (https://phabricator.wikimedia.org/T182978) [12:07:27] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3230583 (10MoritzMuehlenhoff) These are still showing up in https://servermon.wikimedia.org/hosts/, probably "puppet node deactivate" is missing [12:09:45] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3787781 (10MoritzMuehlenhoff) Still showing in servermon, also seems like a missing "puppet node deactivate" [12:10:33] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator, 10hardware-requests: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3599964 (10MoritzMuehlenhoff) Still showing in servermon, also seems like a missing "puppet node deactivate" [12:13:21] (03PS2) 10Elukey: Add EL Edit's schema action.loaded.timing field to purging whitelist [puppet] - 10https://gerrit.wikimedia.org/r/398448 (https://phabricator.wikimedia.org/T182978) (owner: 10Mforns) [12:13:55] (03CR) 10Elukey: [C: 032] Add EL Edit's schema action.loaded.timing field to purging whitelist [puppet] - 10https://gerrit.wikimedia.org/r/398448 (https://phabricator.wikimedia.org/T182978) (owner: 10Mforns) [12:15:42] (03CR) 10Elukey: [C: 032] profile::cache::kafka::webrequest::jumbo: force TLS kafka broker port [puppet] - 10https://gerrit.wikimedia.org/r/398446 (owner: 10Elukey) [12:15:49] (03PS2) 10Elukey: profile::cache::kafka::webrequest::jumbo: force TLS kafka broker port [puppet] - 10https://gerrit.wikimedia.org/r/398446 [12:17:54] (03CR) 10Elukey: "Andrew I take full responsibility for this horrible hack, but I wanted to get us unblocked for testing the new vk instance while either:" [puppet] - 10https://gerrit.wikimedia.org/r/398446 (owner: 10Elukey) [12:21:23] (03PS1) 10Jcrespo: [WIP] Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [12:21:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [12:22:44] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4028 is CRITICAL: connect to address 10.128.0.128 and port 3128: Connection refused [12:23:40] (03PS2) 10Jcrespo: [WIP] Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [12:23:44] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4028 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.157 second response time [12:24:06] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [12:28:31] (03PS3) 10Jcrespo: [WIP] Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [12:32:44] (03PS1) 10Marostegui: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398451 (https://phabricator.wikimedia.org/T180788) [12:33:06] (03CR) 10Marostegui: [C: 04-2] "Server still catching up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398451 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [12:38:54] (03PS2) 10Marostegui: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398451 (https://phabricator.wikimedia.org/T180788) [12:40:59] (03PS3) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [12:41:31] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [12:44:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398451 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [12:45:42] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398451 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [12:46:57] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1084 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398451 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [12:47:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1084 with low weight - T180788 (duration: 00m 57s) [12:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:33] T180788: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788 [12:47:44] (03PS4) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [12:48:13] (03CR) 10jerkins-bot: [V: 04-1] apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [12:57:35] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [12:58:36] seems esams-related, single peak then recovery [12:58:54] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [12:59:25] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [13:03:15] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398454 [13:06:55] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [13:07:19] (03PS1) 10Elukey: profile::cache::kafka::webrequest::jumbo: add graphite metrics [puppet] - 10https://gerrit.wikimedia.org/r/398455 [13:07:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=esamsvar-cache_type=Allvar-status_type=5 [13:07:35] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [13:07:56] (03CR) 10Elukey: [C: 032] profile::cache::kafka::webrequest::jumbo: add graphite metrics [puppet] - 10https://gerrit.wikimedia.org/r/398455 (owner: 10Elukey) [13:08:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398454 (owner: 10Marostegui) [13:09:27] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3840331 (10Marostegui) a:05Cmjohnson>03Marostegui db1111 has now commonswiki and eowiki there as requested by @daniel. It is replicating s4 (but as spoke, we will remove replication once... [13:09:32] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398454 (owner: 10Marostegui) [13:09:44] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398454 (owner: 10Marostegui) [13:10:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1084 (duration: 00m 56s) [13:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:38] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398456 [13:16:50] (03PS1) 10Arturo Borrero Gonzalez: apt: report-pending-upgrades.sh: add verbosity flag [puppet] - 10https://gerrit.wikimedia.org/r/398458 (https://phabricator.wikimedia.org/T181647) [13:19:29] (03CR) 10Rush: [C: 031] apt: report-pending-upgrades.sh: add verbosity flag [puppet] - 10https://gerrit.wikimedia.org/r/398458 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:19:54] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: report-pending-upgrades.sh: add verbosity flag [puppet] - 10https://gerrit.wikimedia.org/r/398458 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [13:20:08] (03PS2) 10Arturo Borrero Gonzalez: apt: report-pending-upgrades.sh: add verbosity flag [puppet] - 10https://gerrit.wikimedia.org/r/398458 (https://phabricator.wikimedia.org/T181647) [13:20:38] !log disable puppet across eqiad lab* things to land a bit of code gracefully [13:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:00] (03CR) 10Rush: [C: 032] openstack: nova/compute/server.pp manage nova shell [puppet] - 10https://gerrit.wikimedia.org/r/398312 (owner: 10Rush) [13:23:12] (03PS6) 10Rush: openstack: nova/compute/server.pp manage nova shell [puppet] - 10https://gerrit.wikimedia.org/r/398312 [13:25:32] (03CR) 10Muehlenhoff: Add Prometheus exporter for Blazegraph (035 comments) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398272 (https://phabricator.wikimedia.org/T182857) (owner: 10Muehlenhoff) [13:25:40] (03PS3) 10Muehlenhoff: Add Prometheus exporter for Blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/398272 (https://phabricator.wikimedia.org/T182857) [13:29:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398456 (owner: 10Marostegui) [13:29:43] (03PS1) 10Hashar: Sphinx documentation [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398462 [13:30:10] (03CR) 10jerkins-bot: [V: 04-1] Sphinx documentation [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398462 (owner: 10Hashar) [13:30:52] (03PS1) 10Rush: openstack: remove duplicate definition in compute/service [puppet] - 10https://gerrit.wikimedia.org/r/398463 [13:31:23] (03CR) 10Rush: [C: 032] openstack: remove duplicate definition in compute/service [puppet] - 10https://gerrit.wikimedia.org/r/398463 (owner: 10Rush) [13:31:52] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398456 (owner: 10Marostegui) [13:32:07] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398456 (owner: 10Marostegui) [13:33:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1084 and db1081 (duration: 00m 56s) [13:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:39] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398465 [13:45:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398465 (owner: 10Marostegui) [13:45:26] (03PS7) 10Muehlenhoff: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) [13:46:26] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398465 (owner: 10Marostegui) [13:46:50] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398465 (owner: 10Marostegui) [13:47:25] (03PS1) 10Hashar: From the future [puppet] - 10https://gerrit.wikimedia.org/r/398466 [13:47:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1084 and restore original weight for db1081 (duration: 00m 56s) [13:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:15] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [13:48:37] \o/ [13:48:46] kafka1023 might have finished bootstrapping [13:49:14] (03Abandoned) 10Hashar: From the future [puppet] - 10https://gerrit.wikimedia.org/r/398466 (owner: 10Hashar) [13:49:20] not yet but looking good [13:51:31] (03PS1) 10Marostegui: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398467 [13:52:05] (03PS1) 10Elukey: role::cache::misc: add new vk instance to test TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/398468 [13:52:52] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398467 (owner: 10Marostegui) [13:53:13] (03PS2) 10Elukey: role::cache::misc: add new vk instance to test TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/398468 (https://phabricator.wikimedia.org/T175461) [13:53:24] (03PS2) 10Marostegui: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398467 [13:53:36] (03CR) 10jerkins-bot: [V: 04-1] role::cache::misc: add new vk instance to test TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/398468 (https://phabricator.wikimedia.org/T175461) (owner: 10Elukey) [13:54:03] ah bug before change id [13:54:31] yeah, jenkins doesn't even care if it is friday! [13:54:48] (03PS3) 10Elukey: role::cache::misc: add new vk instance to test TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/398468 (https://phabricator.wikimedia.org/T175461) [14:00:54] (03PS1) 10Rush: rush: add a helper script for localrun [puppet] - 10https://gerrit.wikimedia.org/r/398470 [14:01:40] (03CR) 10Rush: [C: 032] rush: add a helper script for localrun [puppet] - 10https://gerrit.wikimedia.org/r/398470 (owner: 10Rush) [14:09:04] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/rush/bin/plocal.sh] [14:20:27] 10Operations, 10Goal, 10User-fgiunchedi: Port nutcracker statistics to Prometheus - https://phabricator.wikimedia.org/T181995#3840485 (10fgiunchedi) I tried backporting ruby-mmap2 to jessie, even after disabling the failing tests I'm getting a segmentation fault around mmap values ``` $ prometheus_multiproc... [14:24:07] (03PS8) 10Rush: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:24:37] (03CR) 10jerkins-bot: [V: 04-1] Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:25:48] (03PS2) 10Hashar: Sphinx documentation [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398462 [14:27:01] (03PS9) 10Rush: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:27:28] (03CR) 10jerkins-bot: [V: 04-1] Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:28:08] (03PS11) 10Rush: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:28:41] (03CR) 10jerkins-bot: [V: 04-1] Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:31:03] (03PS12) 10Rush: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:31:31] (03CR) 10jerkins-bot: [V: 04-1] Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:32:32] (03PS13) 10Rush: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:33:08] (03CR) 10jerkins-bot: [V: 04-1] Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:34:00] 10th times the charm? [14:34:01] geez [14:34:04] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:34:25] (03PS14) 10Rush: Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:35:43] (03CR) 10Rush: "I amended this to take into account our params are per openstack deployment (rabbit_pass) etc." [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:45:27] (03CR) 10Muehlenhoff: Add Prometheus RabbitMQ exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:48:18] (03CR) 10Rush: Add Prometheus RabbitMQ exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:49:28] (03CR) 10Muehlenhoff: [C: 032] Add Prometheus RabbitMQ exporter [puppet] - 10https://gerrit.wikimedia.org/r/398044 (https://phabricator.wikimedia.org/T181802) (owner: 10Muehlenhoff) [14:49:40] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#3840568 (10Gehel) [14:55:26] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:34] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:57] ^known [14:55:59] (03PS1) 10Rush: openstack: profile/manifests/openstack/base/rabbitmq typo [puppet] - 10https://gerrit.wikimedia.org/r/398478 [14:56:47] (03CR) 10Rush: [C: 032] openstack: profile/manifests/openstack/base/rabbitmq typo [puppet] - 10https://gerrit.wikimedia.org/r/398478 (owner: 10Rush) [14:58:07] looking at dbstore1001 [14:58:18] 10Operations, 10Puppet: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894#3837838 (10Andrew) (update: @herron is in the process of backporting) [14:58:24] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:59:04] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.cu_changes: Cant find record in cu_changes, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001643, end_log_pos 1047543370 [14:59:15] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [15:01:43] (03PS1) 10Rush: openstack: type for rabbitmq base [puppet] - 10https://gerrit.wikimedia.org/r/398479 [15:02:35] (03CR) 10Rush: [C: 032] openstack: type for rabbitmq base [puppet] - 10https://gerrit.wikimedia.org/r/398479 (owner: 10Rush) [15:05:34] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:07:07] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3840595 (10Jonas) Here is the SSH key. {F11846697} @Smalyshev or @Gehel could you please vouch for me? [15:07:19] moritzm: ^ :) thanks for bearing with me [15:09:43] (03PS1) 10Muehlenhoff: Fix srange for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398480 [15:11:11] (03CR) 10Rush: [C: 031] "our use of ipv6 is inconsistent atm" [puppet] - 10https://gerrit.wikimedia.org/r/398480 (owner: 10Muehlenhoff) [15:11:38] (03PS2) 10Muehlenhoff: Fix srange for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398480 [15:11:56] !log reimage labtestvirt2003.codfw.wmnet [15:12:06] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:17] (03CR) 10Muehlenhoff: [C: 032] Fix srange for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/398480 (owner: 10Muehlenhoff) [15:17:54] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [15:17:57] (03Abandoned) 10Elukey: role::cache::misc: add new vk instance to test TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/398468 (https://phabricator.wikimedia.org/T175461) (owner: 10Elukey) [15:18:09] more ganeti fun? [15:18:24] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:18:25] PROBLEM - HHVM rendering on mw2104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:15] RECOVERY - HHVM rendering on mw2104 is OK: HTTP OK: HTTP/1.1 200 OK - 80330 bytes in 0.307 second response time [15:24:24] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [15:24:30] 10Operations, 10ops-codfw: Disconnect furud's disk shelves - https://phabricator.wikimedia.org/T181725#3840637 (10Papaul) 05Open>03Resolved complete [15:25:24] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:25:42] (03PS1) 10Rush: prometheus: rabbitmq use @ for vars in templete fulfillment [puppet] - 10https://gerrit.wikimedia.org/r/398482 [15:26:54] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [15:27:02] (03CR) 10Rush: [C: 032] prometheus: rabbitmq use @ for vars in templete fulfillment [puppet] - 10https://gerrit.wikimedia.org/r/398482 (owner: 10Rush) [15:31:45] (03PS1) 10Rush: openstack: set monitor_password as var for rabbit base [puppet] - 10https://gerrit.wikimedia.org/r/398483 [15:32:42] (03CR) 10Rush: [C: 032] openstack: set monitor_password as var for rabbit base [puppet] - 10https://gerrit.wikimedia.org/r/398483 (owner: 10Rush) [15:33:19] (03PS4) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [15:33:49] (03CR) 10jerkins-bot: [V: 04-1] Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:34:51] (03PS5) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [15:34:55] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3840663 (10elukey) p:05Triage>03Normal [15:35:52] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3840679 (10akosiaris) >>! In T181121#3839679, @MoritzMuehlenhoff wrote: >>>! In T181121#3831708, @Cmjohnson wrote: >> I updated the Bios Version to 2.6 > > Which host had the BIO... [15:36:24] (03PS1) 10Andrew Bogott: puppet.conf: replace configtimeout [puppet] - 10https://gerrit.wikimedia.org/r/398484 (https://phabricator.wikimedia.org/T182585) [15:36:28] (03PS1) 10Muehlenhoff: Add config file to systemd unit / upstart job [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398485 [15:39:56] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add config file to systemd unit / upstart job [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/398485 (owner: 10Muehlenhoff) [15:44:07] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3840719 (10jcrespo) @akosiaris Manuel and Luca where able to reimage, not sure if in a hacky way or with a good or bad workaround, but they were able to. I think however, there is... [15:44:37] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3840721 (10Gehel) Looking at the `prometheus-jmx-exporter` .deb, it seems to depend on `default-jre` which is `openjdk-7-jre` on Je... [15:49:53] (03PS3) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [15:49:55] (03PS1) 10Andrew Bogott: labtest dns: allow hiera to set aliaser_extra_records for labtest [puppet] - 10https://gerrit.wikimedia.org/r/398488 [15:50:17] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3840732 (10MoritzMuehlenhoff) [15:50:20] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Package RabbitMQ exporter for Prometheus and adapt metrics - https://phabricator.wikimedia.org/T181802#3840730 (10MoritzMuehlenhoff) 05Open>03Resolved The exporter has been written, packaged and deployed. [15:50:46] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3650139 (10MoritzMuehlenhoff) [15:50:47] (03PS6) 10Jcrespo: Update mariadb::proxy to the latest style and path locations [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) [15:50:49] (03PS2) 10Andrew Bogott: labtest dns: allow hiera to set aliaser_extra_records for labtest [puppet] - 10https://gerrit.wikimedia.org/r/398488 [15:50:51] (03PS4) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [15:51:39] (03CR) 10Andrew Bogott: [C: 032] labtest dns: allow hiera to set aliaser_extra_records for labtest [puppet] - 10https://gerrit.wikimedia.org/r/398488 (owner: 10Andrew Bogott) [15:52:30] (03CR) 10Jcrespo: "This is ready to be reviewed, sadly there seems to be a breakage on the puppet compiler, so I cannot check it is effectively a noop: https" [puppet] - 10https://gerrit.wikimedia.org/r/398450 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:56:55] (03PS1) 10Jcrespo: labsdb: Switchover labsdb1009 to labsdb1010 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/398489 (https://phabricator.wikimedia.org/T181777) [15:57:35] (03PS2) 10Jcrespo: labsdb: Switchover labsdb1009 to labsdb1010 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/398489 (https://phabricator.wikimedia.org/T181777) [15:58:58] 10Operations, 10Patch-For-Review: Debian Jessie reimage/install ends up in kernel panic with 8.10 netboot image - https://phabricator.wikimedia.org/T182702#3840739 (10elukey) So the reimages now work but the above code changes need to be merged/tested to get the first puppet run and wmf-auto-reimage work prope... [15:59:16] (03PS5) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [15:59:18] (03PS1) 10Andrew Bogott: labtest hiera: move some settings to a different hiera file [puppet] - 10https://gerrit.wikimedia.org/r/398490 [15:59:27] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:42] (03CR) 10Andrew Bogott: [C: 032] labtest hiera: move some settings to a different hiera file [puppet] - 10https://gerrit.wikimedia.org/r/398490 (owner: 10Andrew Bogott) [16:01:49] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Remove detail from wbcheckconstraints API response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396311 (https://phabricator.wikimedia.org/T180614) (owner: 10Lucas Werkmeister (WMDE)) [16:02:40] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3840744 (10elukey) [16:03:02] (03PS1) 10Andrew Bogott: Revert "labtest hiera: move some settings to a different hiera file" [puppet] - 10https://gerrit.wikimedia.org/r/398494 [16:03:35] (03CR) 10Andrew Bogott: [C: 032] Revert "labtest hiera: move some settings to a different hiera file" [puppet] - 10https://gerrit.wikimedia.org/r/398494 (owner: 10Andrew Bogott) [16:04:35] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3840757 (10akosiaris) >>! In T181121#3840719, @jcrespo wrote: > I think however, there is no proper upstream fix yet, at least deployed by us. Yup, that's what I am waiting for.... [16:05:10] 10Operations: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656#3840758 (10MoritzMuehlenhoff) These are fully rolled out: dput libdbi python-tablib syslinux [16:08:27] (03PS6) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [16:08:29] (03PS1) 10Andrew Bogott: labtest dns: move some hiera settings to the right place, I hope [puppet] - 10https://gerrit.wikimedia.org/r/398498 [16:09:23] (03CR) 10Andrew Bogott: [C: 032] labtest dns: move some hiera settings to the right place, I hope [puppet] - 10https://gerrit.wikimedia.org/r/398498 (owner: 10Andrew Bogott) [16:09:25] (03CR) 10Marostegui: [C: 031] labsdb: Switchover labsdb1009 to labsdb1010 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/398489 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [16:10:32] !log re-enable piwik on bohrium after mysql backup restore [16:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:48] (03CR) 10Nuria: [C: 031] Add clickstream rsync for external visibility [puppet] - 10https://gerrit.wikimedia.org/r/396324 (https://phabricator.wikimedia.org/T175844) (owner: 10Joal) [16:14:58] (03PS5) 10Nuria: Add clickstream rsync for external visibility [puppet] - 10https://gerrit.wikimedia.org/r/396324 (https://phabricator.wikimedia.org/T175844) (owner: 10Joal) [16:16:41] (03CR) 10Elukey: [C: 032] Add clickstream rsync for external visibility [puppet] - 10https://gerrit.wikimedia.org/r/396324 (https://phabricator.wikimedia.org/T175844) (owner: 10Joal) [16:19:34] apergos: --^ :) [16:19:46] (03Abandoned) 10Mobrovac: Trending Edits: Stop and mask the service [puppet] - 10https://gerrit.wikimedia.org/r/398286 (https://phabricator.wikimedia.org/T180384) (owner: 10Mobrovac) [16:21:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398467 (owner: 10Marostegui) [16:21:25] thanks for the heads up [16:21:40] running the cron manually now just to avoid weird surprises [16:22:41] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398467 (owner: 10Marostegui) [16:22:56] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1084 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398467 (owner: 10Marostegui) [16:23:36] k [16:23:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1084 original weight (duration: 00m 57s) [16:23:54] (03PS7) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [16:23:56] (03PS1) 10Andrew Bogott: labservices: rearrange hiera for recursor::aliaser_extra_records [puppet] - 10https://gerrit.wikimedia.org/r/398499 [16:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:02] PROBLEM - MD RAID on labtestvirt2003 is CRITICAL: Return code of 255 is out of bounds [16:37:11] (03PS2) 10Andrew Bogott: labservices: rearrange hiera for recursor::aliaser_extra_records [puppet] - 10https://gerrit.wikimedia.org/r/398499 [16:37:13] (03PS8) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [16:37:27] (03PS1) 10Joal: Correct typo in Analytics dump HTML page [puppet] - 10https://gerrit.wikimedia.org/r/398502 [16:38:22] PROBLEM - HHVM rendering on mw2208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:23] (03PS3) 10Andrew Bogott: labservices: rearrange hiera for recursor::aliaser_extra_records [puppet] - 10https://gerrit.wikimedia.org/r/398499 [16:38:25] (03PS9) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [16:39:12] RECOVERY - HHVM rendering on mw2208 is OK: HTTP OK: HTTP/1.1 200 OK - 80444 bytes in 0.307 second response time [16:40:25] (03CR) 10Rush: [C: 031] labservices: rearrange hiera for recursor::aliaser_extra_records [puppet] - 10https://gerrit.wikimedia.org/r/398499 (owner: 10Andrew Bogott) [16:40:50] elukey: --^ if you have a minute [16:41:22] (03CR) 10Andrew Bogott: [C: 032] labservices: rearrange hiera for recursor::aliaser_extra_records [puppet] - 10https://gerrit.wikimedia.org/r/398499 (owner: 10Andrew Bogott) [16:44:06] (03CR) 10Brian Wolff: [C: 031] "This looks correct to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [16:44:32] !log labtestvirt2003:~# /sbin/reboot to pickup new kernel [16:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:27] (03PS10) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [16:46:29] (03PS1) 10Andrew Bogott: labtest dns: fix a one-letter typo [puppet] - 10https://gerrit.wikimedia.org/r/398504 [16:47:00] (03CR) 10jerkins-bot: [V: 04-1] labtest dns: fix a one-letter typo [puppet] - 10https://gerrit.wikimedia.org/r/398504 (owner: 10Andrew Bogott) [16:48:22] PROBLEM - Apache HTTP on mw2105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:33] (03PS2) 10Andrew Bogott: labtest dns: fix a one-letter typo [puppet] - 10https://gerrit.wikimedia.org/r/398504 [16:48:35] (03PS11) 10Andrew Bogott: wmcs puppet: Support instance agents using the 'puppet' master hostname [puppet] - 10https://gerrit.wikimedia.org/r/398323 [16:49:13] RECOVERY - Apache HTTP on mw2105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time [16:49:43] (03CR) 10Andrew Bogott: [C: 032] labtest dns: fix a one-letter typo [puppet] - 10https://gerrit.wikimedia.org/r/398504 (owner: 10Andrew Bogott) [16:52:26] (03PS2) 10Alexandros Kosiaris: Correct typo in Analytics dump HTML page [puppet] - 10https://gerrit.wikimedia.org/r/398502 (owner: 10Joal) [16:52:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Correct typo in Analytics dump HTML page [puppet] - 10https://gerrit.wikimedia.org/r/398502 (owner: 10Joal) [16:54:23] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:56:36] joal: Alex was faster! [16:57:37] thanks a lot Alex :) [16:59:40] (03PS3) 10Jcrespo: labsdb: Switchover labsdb1009 to labsdb1010 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/398489 (https://phabricator.wikimedia.org/T181777) [17:00:58] (03PS1) 10Filippo Giunchedi: First version [debs/prometheus-nutcracker-exporter] - 10https://gerrit.wikimedia.org/r/398505 (https://phabricator.wikimedia.org/T181995) [17:14:13] 10Operations, 10ops-eqiad: Disconnect flerovium's disk shelves - https://phabricator.wikimedia.org/T181724#3840990 (10faidon) 05Open>03Resolved Confirmed this was shipped now. Documentation was sent out of band, this can be resolved :) [17:14:25] (03CR) 10Jcrespo: [C: 032] labsdb: Switchover labsdb1009 to labsdb1010 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/398489 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [17:18:44] !log reloading dbproxy1010 [17:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:23] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29fullscreenorgId=1 [17:25:44] 10Operations, 10Packaging, 10Scap: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3841015 (10akosiaris) After some investigation it turns out I built `master` and not `release` branch which caused all this nice things. I am building `3.7.4-2` to fix this [17:28:48] (03PS1) 10Thcipriani: Scap: bump version to 3.7.4-2 [puppet] - 10https://gerrit.wikimedia.org/r/398507 (https://phabricator.wikimedia.org/T182347) [17:30:18] (03PS2) 10Thcipriani: Scap: bump version to 3.7.4-2 [puppet] - 10https://gerrit.wikimedia.org/r/398507 (https://phabricator.wikimedia.org/T182347) [17:32:45] !log stop, upgrade and reboot labsdb1009 [17:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:26] the haproxy alarm will be me [17:35:31] see above [17:36:43] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [17:37:50] (03PS1) 10Jcrespo: Revert "labsdb: Switchover labsdb1009 to labsdb1010 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/398508 [17:38:37] (03CR) 10Jcrespo: "@Marostegui: you can do the alters when you want, I will let you merge the reversion." [puppet] - 10https://gerrit.wikimedia.org/r/398508 (owner: 10Jcrespo) [17:40:43] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [17:40:59] we are back [17:52:54] (03Abandoned) 10Elukey: hadoop: raise jvm heap sizes for HDFS datanode and Yarn daemons [puppet] - 10https://gerrit.wikimedia.org/r/390237 (https://phabricator.wikimedia.org/T178876) (owner: 10Elukey) [17:55:32] (03PS1) 10Andrew Bogott: labnet dnsmasq: use upstream dns servers [puppet] - 10https://gerrit.wikimedia.org/r/398510 (https://phabricator.wikimedia.org/T181375) [17:55:57] (03CR) 10Andrew Bogott: [C: 04-2] "I'm pretty sure that https://gerrit.wikimedia.org/r/#/c/398510/ is a better solution for this." [puppet] - 10https://gerrit.wikimedia.org/r/393841 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [18:01:33] RECOVERY - Kafka Broker Replica Max Lag on kafka1023 is OK: OK: Less than 50.00% above the threshold [1000000.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16fullscreenorgId=1 [18:05:07] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3841208 (10Smalyshev) Vouched. [18:09:33] (03CR) 10Chad: "I think a unit test is a fine addition to keep us sane. Really, the only times these are changed are when a new wiki is created or a new D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398400 (owner: 10Chad) [18:13:19] (03CR) 10Chad: [C: 032] "There's no reason for this to stay open, much as I hate the status quo :(" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 (owner: 10Hashar) [18:13:49] (03Merged) 10jenkins-bot: Add .gitreview [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/397747 (owner: 10Hashar) [18:30:15] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3841276 (10Jonas) a:05Jonas>03None [18:32:40] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3841291 (10Jgreen) [18:32:43] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Spike: Spike: Enumerate remaining unported stats - https://phabricator.wikimedia.org/T175850#3841288 (10Jgreen) 05Open>03Resolved a:03Jgreen We have fundraising grafana dashboards now, that cover the stuff we care about. [18:42:43] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [18:44:04] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [18:44:04] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [18:44:12] (03CR) 10Alexandros Kosiaris: [C: 032] Scap: bump version to 3.7.4-2 [puppet] - 10https://gerrit.wikimedia.org/r/398507 (https://phabricator.wikimedia.org/T182347) (owner: 10Thcipriani) [18:46:34] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [18:48:36] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3841335 (10akosiaris) I 've went ahead and built and uploaded the package in order to avoid any immediate issues from the botched release. In D916 jenkins doesn't l... [18:54:32] * apergos raises an eyebrow [18:54:43] I have not been over there, anyone wanna say what's going on? [18:55:42] I guess that's the scap package, akosiaris? ^^ [18:58:30] 10Operations, 10Toolforge: Update ssh key for kaldari on tool forge - https://phabricator.wikimedia.org/T183022#3841341 (10kaldari) [19:00:05] kaldari: re: SSH key for tools, i think you just update it yourself in Horizon web ui [19:00:15] oh yeah? [19:00:28] eh, wait, wikitech wiki [19:00:54] I'll try that... [19:01:34] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:02:03] mutante: That worked! [19:02:17] kaldari: i was confused myself because there is also the same in Gerrit. cool :) [19:02:30] running puppet pulls in the new package, I wonder why only "my" hosts whines [19:02:31] on the docs for Horizon it says: [19:02:32] *whined [19:02:35] 10Operations, 10Toolforge: Update ssh key for kaldari on tool forge - https://phabricator.wikimedia.org/T183022#3841352 (10kaldari) 05Open>03Resolved a:03kaldari Nevermind, I can do it myself! https://toolsadmin.wikimedia.org/profile/settings/ssh-keys [19:02:40] "These actions may remain on Wikitech, or may be moved to new custom web tools: " [19:02:44] Individual user management: Account creation, password & 2fa management, management of ssh keys for instance access [19:03:25] mutante: I'll update the documentation [19:03:30] awesome :) [19:04:03] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:04:22] ah, toolsadmin.wm.org *nod* [19:07:43] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:08:30] (03PS3) 10Dzahn: mysql: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394518 (https://phabricator.wikimedia.org/T177225) [19:09:03] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:10:51] (03PS4) 10Dzahn: mysql eqiad: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394518 (https://phabricator.wikimedia.org/T177225) [19:12:24] PROBLEM - HHVM rendering on mw2236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:42] apergos: snapshot but also one "sca" in there. almost like it was alphabetically [19:13:13] RECOVERY - HHVM rendering on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 79864 bytes in 0.303 second response time [19:13:55] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scap, and 2 others: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3841366 (10thcipriani) [19:13:57] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: SCAP: Upload debian package version 3.7.4-1 - https://phabricator.wikimedia.org/T182347#3841364 (10thcipriani) 05Open>03Resolved >>! In T182347#3841335, @akosiaris wrote: > I 've went ahead and built and uploaded the package in order to avoid any i... [19:14:20] re: mw2236 self-healing via https://en.wikipedia.org/wiki/Observer_effect_(physics) j/k :) [19:15:29] uh huh [19:16:28] (03PS1) 10RobH: Jonas Kress move from ldap to shell, add to groups [puppet] - 10https://gerrit.wikimedia.org/r/398524 (https://phabricator.wikimedia.org/T182908) [19:16:35] runs puppet on sca2004 [19:16:51] yup, updates scap package [19:22:19] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3841399 (10RobH) p:05Triage>03Normal [19:22:47] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3838373 (10RobH) [19:23:06] (03PS1) 10Dzahn: external storage eqiad: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398526 (https://phabricator.wikimedia.org/T177225) [19:32:25] (03CR) 10Dzahn: [C: 032] external storage eqiad: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398526 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:37:23] (03CR) 10Dzahn: "no issues in this group, no manual kills needed" [puppet] - 10https://gerrit.wikimedia.org/r/398526 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:39:10] !log reboot labtestvirt2003 [19:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:27] (03PS1) 10Dzahn: dbproxy eqiad: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398528 (https://phabricator.wikimedia.org/T177225) [19:44:31] (03CR) 10Dzahn: [C: 032] "these gotta go by regex because the role keyword isn't used yet" [puppet] - 10https://gerrit.wikimedia.org/r/398528 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [19:55:28] (03PS1) 10Dzahn: db eqiad: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398531 (https://phabricator.wikimedia.org/T177225) [20:05:55] (03CR) 10Dzahn: [C: 032] db eqiad: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/398531 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:11:02] (03PS1) 10Madhuvishy: maintain-dbusers: Stop managing account creation for labsdb1001 and 1003 [puppet] - 10https://gerrit.wikimedia.org/r/398533 (https://phabricator.wikimedia.org/T183029) [20:12:00] is closely watching for any special cases on db* hosts but nothing yet [20:13:59] (03CR) 10Madhuvishy: [C: 032] maintain-dbusers: Stop managing account creation for labsdb1001 and 1003 [puppet] - 10https://gerrit.wikimedia.org/r/398533 (https://phabricator.wikimedia.org/T183029) (owner: 10Madhuvishy) [20:27:41] (03PS15) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [20:30:47] (03PS16) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [20:36:46] (03CR) 10Dzahn: "instead done separately in multiple steps. and now 3 labsdb* hosts need to stay for a moment" [puppet] - 10https://gerrit.wikimedia.org/r/394518 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:38:47] (03PS1) 10Dbrant: Update whitelist for a few mobile app schemas. [puppet] - 10https://gerrit.wikimedia.org/r/398542 [20:51:59] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3841695 (10Dzahn) Alright, Ganglia is purged from everything across the board, except 17 hosts now! :) They are: 4 x maps codfw (osm/postgres) 4 x maps eqiad (osm/postgres) 3 x ma... [21:12:23] PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:13] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 79848 bytes in 0.156 second response time [21:23:12] (03PS1) 10BryanDavis: wiki replicas: point *.labsdb to *.analytics.db.svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/398551 (https://phabricator.wikimedia.org/T142807) [21:28:37] (03CR) 10Andrew Bogott: [C: 032] wiki replicas: point *.labsdb to *.analytics.db.svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/398551 (https://phabricator.wikimedia.org/T142807) (owner: 10BryanDavis) [21:32:31] PROBLEM - MD RAID on labtestvirt2003 is CRITICAL: Return code of 255 is out of bounds [22:25:39] (03Abandoned) 10Andrew Bogott: TESTING: this is a patch that introduces an intentional broken diff [puppet] - 10https://gerrit.wikimedia.org/r/383860 (owner: 10Andrew Bogott) [22:25:51] (03Abandoned) 10Andrew Bogott: DO NOT MERGE: no-op patch for testing [puppet] - 10https://gerrit.wikimedia.org/r/383942 (owner: 10Andrew Bogott) [22:25:58] (03Abandoned) 10Andrew Bogott: DO NOT MERGE: testing patch that should break with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/384595 (owner: 10Andrew Bogott) [22:30:41] no_justification: around? I think we need to deploy https://gerrit.wikimedia.org/r/#/c/398599/ to fix https://phabricator.wikimedia.org/T182867 [22:30:55] is it ok if I sync it out? [22:32:30] lgtm [22:37:13] ummm [22:37:14] no_justification: [22:37:15] 22:36:59 ['/srv/deployment/scap/scap/bin/scap', 'pull', '--no-update-l10n', '--include', 'php-1.31.0-wmf.12', '--include', 'php-1.31.0-wmf.12/extensions', '--include', 'php-1.31.0-wmf.12/extensions/LoginNotify', '--include', 'php-1.31.0-wmf.12/extensions/LoginNotify/includes', '--include', 'php-1.31.0-wmf.12/extensions/LoginNotify/includes/LoginNotify.php', 'tin.eqiad.wmnet', 'naos.codfw.wmnet', 'tin.eqiad.wmnet'] on mw1264.eqiad.wmnet [22:37:15] returned [127]: bash: /srv/deployment/scap/scap/bin/scap: No such file or directory [22:37:36] Ummmmm. [22:37:37] 22:36:59 ['/srv/deployment/scap/scap/bin/scap', 'pull-master', 'tin.eqiad.wmnet'] on naos.codfw.wmnet returned [127]: Could not chdir to home directory /var/lib/mwdeploy: No such file or directory [22:37:37] bash: /srv/deployment/scap/scap/bin/scap: No such file or directory [22:37:39] thcipriani: ^ [22:37:52] 10Operations, 10Puppet: Trusty puppet 4 approach - https://phabricator.wikimedia.org/T182894#3841972 (10herron) A first stab at Trusty packages for puppet 4.8.2 and dependencies (hiera, ruby-deep-merge) have been built on boron (in /var/cache/pbuilder/result/trusty-amd64/) and appear to be working after cursor... [22:37:56] also ummm [22:38:05] all of sync masters and canaries failed [22:38:17] /srv/deployment/scap/* shouldn't exist anymore in prod.... [22:38:31] I wonder if the patch file didn't get applied on deb package build? [22:38:33] the command I ran was [22:38:34] legoktm@tin:/srv/mediawiki-staging$ scap sync-file php-1.31.0-wmf.12/extensions/LoginNotify/includes/LoginNotify.php "Use extension registry to check for CheckUser to be installed - T182867" [22:38:34] T182867: "Login to Wikidata as QuickStatementsBot from a computer you have not recently used" - https://phabricator.wikimedia.org/T182867 [22:39:17] yeah, if the quilt patches didn't get applied in the debian package... [22:40:28] was a new version deployed? [22:40:46] this morning, yeah [22:41:03] was it tested afterwards? :p [22:41:32] no :( [22:41:52] so /usr/lib/python2.7/dist-packages/scap/config.py shows the wrong path to scap. That's applied via quilt. [22:42:08] it should have been a redeploy of the same version that was running. [22:42:27] except with a config flag that was missing. [22:42:49] is a rollback possible? [22:42:54] details: https://phabricator.wikimedia.org/T182347 [22:43:09] not possible with my powers [22:43:14] if there's an opsen around maybe. [22:43:25] I can rebuild the package, but can't upload [22:43:44] mutante: around? [22:45:18] heis at lunch I think [22:46:32] I think we need someone to rebuild and upload the scap package? [22:47:00] ugh [22:47:18] can we not uh [22:47:25] the old package is not around? [22:47:27] legoktm: what's up [22:47:32] just force downgrade or something? [22:47:32] eh [22:47:45] 10Operations, 10Scap: scap is broken bash: /srv/deployment/scap/scap/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T183046#3841983 (10Legoktm) p:05Triage>03Unbreak! [22:47:55] but .. wasnt it upgraded earlier by those puppet runs [22:48:00] apergos [22:48:04] yes it was [22:48:05] mutante: it was [22:48:13] so the question is whether theold package is still around someplace [22:48:25] if it's a debian package and it was just updated recently [22:48:27] what's the build host these days? [22:48:29] it's probably still there in /var/ [22:48:33] ah indeed [22:48:35] (on the target upgraded hosts) [22:48:48] the build host now is boron [22:48:56] which is confusing because boron was something different. but yea [22:49:06] look for it on the target hosts that get the package install in /var/cache/apt/archives/ [22:49:11] so the situation is: the version that was release on monday was cut from the wrong branch, that was a configuration variable that was out of place. I asked for a new package cut this morning from the release branch, but it looks like for whatever reason quilt patches didn't get applied. [22:49:19] and then you could do a cumin "dpkg -i /var/cache/apt/archives/foo" [22:49:53] I have a copy [22:50:05] let me get it over to cumin though [22:50:17] i see 4.5.0-6 [22:50:17] bblack@naos:~$ ls /var/cache/apt/archives/scap_3* [22:50:17] /var/cache/apt/archives/scap_3.5.5-1_all.deb /var/cache/apt/archives/scap_3.6.0-1_all.deb /var/cache/apt/archives/scap_3.7.2-1_all.deb [22:50:19] and several older ones [22:50:21] /var/cache/apt/archives/scap_3.5.6-1_all.deb /var/cache/apt/archives/scap_3.6.0-2_all.deb /var/cache/apt/archives/scap_3.7.3-1_all.deb [22:50:23] /var/cache/apt/archives/scap_3.5.7-1_all.deb /var/cache/apt/archives/scap_3.7.0-1_all.deb /var/cache/apt/archives/scap_3.7.4-1_all.deb [22:50:26] /var/cache/apt/archives/scap_3.5.8-1_all.deb /var/cache/apt/archives/scap_3.7.1-1_all.deb /var/cache/apt/archives/scap_3.7.4-2_all.deb [22:50:38] if all the targest have it there, you can just install from the local path on them all without copying it around [22:50:39] sorry, wrong number. yes, 3.7.4-2 [22:50:54] 3.7.4-1 is safe-ish [22:51:12] still need to update it to the correct branch so that it's not wasting disk space on tin, but it's better than broken [22:51:44] that's what we've been using all this week [22:52:24] the one I got is uh scap_3.7.2-1_all.deb and what's installed as of today is um scap_3.7.3-1_all.deb [22:52:26] so, on which hosts ? should i install 3.7.4.1 on tin? [22:52:34] and remove 3.7.4-2 [22:52:37] well on the host I'm on [22:53:12] I...am not sure what would happen if we downgraded to 3.7.3 [22:53:19] i am on tin, 3.7.4-2 is installed and scap_3.7.4-1_all.deb is in /var/cache/apt/archives [22:53:23] fine [22:53:26] ah cool [22:54:07] I wonder how that's not the version on snapshot1007 accoring to dpkg [22:54:13] lemme doublecheck [22:54:34] oh it's hilarious [22:54:41] so we need to install 3.7.4-1 all over and revert https://gerrit.wikimedia.org/r/#/c/398507/ [22:54:44] it's being installed and then either silently failing or being removed [22:54:45] nice [22:55:16] oh good. [22:55:33] I wonder what the version is actually installed on most hosts [22:55:38] (cumin check anyone? I'm not there) [22:55:47] !log tin - apt-get remove scap ; dpkg -i /var/cache/apt/archives/scap_3.7.4-1_all.deb [22:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:59] ii scap 3.7.4-1 [22:56:08] (03PS1) 10Thcipriani: Revert "Scap: bump version to 3.7.4-2" [puppet] - 10https://gerrit.wikimedia.org/r/398603 [22:56:38] ^ mutante I think that will need to merge as well otherwise puppet'll undo what you just did [22:57:23] i wonder about the right order [22:57:33] what will happen if we merge before doing that on everything [22:57:38] and what is the list of hosts we need, heh [22:58:06] I think we'll probably see puppet failures if you merge before installing [22:58:38] you could run the cumin command to downgrade, merge puppet, run the cumin command again and there'd probably be not a lot of fallout? [22:59:00] RECOVERY - MD RAID on labtestvirt2003 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [22:59:12] the list...I don't know enough about cumin, is that info available in puppetdb somehow someway? [22:59:52] yea, so just what is the right selector, by role? [23:00:19] mutante: need help? [23:00:31] the mw hosts seem to have 3.7.4-2 all of them [23:00:34] volans|off: how to target "all hosts using scap" [23:00:38] everything that has scap::init [23:01:02] R:Package = scap should do if the package is managed by puppet [23:01:17] I'm on mobile so cannot verify right now [23:01:17] :) cool! [23:01:49] and I could make syntax errors, but give it a try and see what list returns (without adding a command to execute) [23:01:52] 478 hosts [23:02:11] that's probably in the ballpark [23:02:17] seems about right, yes [23:02:26] can you checek to see what versions they have installed? [23:02:54] runs dpkg -l scap on that list [23:03:11] 'R:Package = scap' 'dpkg -l | grep scap' [23:03:22] no need for the extra grep [23:03:36] I found dpkg -l foo misleading in some occasions [23:03:43] oh? interesting [23:03:58] not long ago, but cannot remember the problem right now [23:04:13] if you ever remember I want to hear the story [23:04:19] beers [23:04:21] it's running, just i added sleep and batch that were overly cautious :p [23:04:25] k [23:04:43] the dpkg? it was, I ran without a batch for dpkg -l and it was fine [23:04:46] across all mws :-P [23:04:47] mutante: sleep for checking the versiob? [23:04:56] no need [23:05:00] :) [23:05:14] the two of us who are off say in unison :-P [23:05:24] ofc [23:05:30] we're not running it [23:05:35] heh heh [23:05:38] :-P [23:06:03] https://phabricator.wikimedia.org/P6476 [23:06:35] uh [23:07:01] do that on the snapshots [23:07:15] I guarantee you it is running 3.7.3-1 [23:07:18] how are those missing? [23:07:39] snapshot[1001,1005-1007] ? [23:07:44] oh I see, they are near the end of that long list [23:07:46] yep [23:07:49] 3.7.3-1 [23:08:14] huh, well that's not great. [23:08:53] no [23:08:58] and I don't know why it's like that [23:09:16] it's weird puppet's not complaining about all those? [23:10:10] it claims to to the install [23:10:18] (I'm on snapshot1007, can't speak for the rest) [23:10:21] *to do [23:10:23] no errors [23:10:32] when I check, the old one is still the one [23:10:37] is there anything else I can do for you? [23:10:42] all these are the trusty hosts, right [23:11:10] volans|off: maybe one last question, can we also target the exact package version [23:11:50] yes trusty [23:12:01] but these are _all_, it should pick [23:12:03] snapshot1001, californium.. both trusty [23:12:09] maybe it's not in the repo somehow or [23:12:16] no that doesn't make sense, why would it try the install them [23:12:18] *then [23:12:24] I don't think puppetdb has the version tracked, not sure but you can use the list from the paste in phab [23:12:28] and do [23:12:58] cumin 'D{long list of hosts}' 'command' [23:13:05] reprepro ls scap [23:13:06] scap | 3.7.4-2 | trusty-wikimedia | amd64, i386, source [23:13:17] does apt-cache policy scap on those hosts show 3.7.4-2 ? [23:13:24] volans|off: ok, thanks! [23:14:18] I'm going back off but if you need me feel free to ping me, I should get the notification ;) [23:14:39] (unless I'm sleeping :-P) [23:14:42] ok,cool [23:15:09] see ya [23:15:42] apergos: so on snapshot1001. apt-get install scap doesnt say "already installed" [23:15:43] let me see about the policy, good call [23:15:48] no it's not [23:15:49] it says "will be installed" [23:15:58] and 0 upgraded [23:16:04] scap: [23:16:05] Installed: 3.7.3-1 [23:16:05] Candidate: 3.7.4-2 [23:16:11] thcipriani: ^^ [23:16:35] interesting so new version is in the cache anyway [23:17:14] so it seems we have 3.7.3-1 in cache on all of them, yea [23:17:14] so if you do: apt-get install scap it *should* update... [23:17:29] yet, it doesnt [23:17:35] The following extra packages will be installed: [23:17:35] scap [23:17:36] Suggested packages: [23:17:36] python-semver [23:17:36] 0 upgraded, 0 newly installed, 0 to remove and 75 not upgraded. [23:17:44] "extra" [23:18:02] no it does not [23:18:11] shoulda coulda woulda [23:18:12] and yet [23:18:29] let me try removing it on snapshot1001, then repeat that [23:18:51] huh [23:19:21] sure go to town [23:19:51] scap : Depends: python-semver but it is not installable [23:20:37] nice [23:20:41] but...that should be in suggests [23:20:41] Setting up scap (3.7.3-1) ... [23:20:49] so.. apt-get install scap = broken [23:20:55] dpkg -i of the 3-1 version = ok [23:21:17] https://github.com/wikimedia/scap/blob/release/debian/control#L19 [23:21:20] please tell me that scap doesn't need python-semver [23:21:56] ah.. this is going to be someting about automatically installing suggests [23:21:59] ? [23:22:05] because ain't no trusty package for it [23:22:37] yeah that's why we kept it over in suggests iirc (although that discussion was a while ago so it's a bit fuzzy) [23:22:51] Also: ended up not using it, so could probably drop entirely [23:23:17] should i try to fix all the non-trusty ones [23:23:24] and then we merge that change? [23:23:32] mutante: yes please [23:23:51] it's less broken than current (we survived for a week in that state) [23:25:03] (broken infofar as it enables a feature we didn't want to yet and that costs disk space. But it won't kill us) [23:25:14] it shouldn't affect deployments if the servers stay on 3.7.3...trying to thing what we added and really 3.7.4 saves disk space, but it shouldn't hurt anything to deploy from tin (on 3.7.4) to a server that's on 3.7.3 [23:25:22] *trying to think [23:26:32] starts to create a host list..from the regex from earlier [23:26:38] ok! [23:28:22] I see APT::Install-Recommends "false"; in the /etc/apt/apt.conf.d files [23:28:28] on snapshot1007 always [23:30:10] does apt-cache info scap look same as the control file I pasted? [23:30:16] wrt python-semver [23:33:56] so [23:33:59] scap depends on python-semver; however: [23:33:59] Package python-semver is not installed. [23:34:07] dpkg -i on the scap 3.7.4-2 deb file [23:34:10] that would be why [23:34:18] (I apt-get download ed it [23:34:19] ) [23:34:37] so go ahead and force around the 3.7.4-1 file to us all [23:35:03] that's a hard depends, must be wrong in the build [23:36:32] PROBLEM - DPKG on snapshot1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:08] !log aqs1004, analytics1003, downgraded scap to 3.7.4-1 [23:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:23] Version: 3.7.4-2 [23:37:23] Architecture: all [23:37:24] Maintainer: Wikimedia Foundation Release Engineering [23:37:24] Installed-Size: 497 [23:37:24] Depends: python, python-configparser, python-jinja2, python-psutil, python-pygments, python-requests, python-semver, python-six, python:any (<< 2.8), python:any (>= 2.7.5-5~), python-yaml, git, [23:37:24] bash-completion, python-conftool [23:37:32] that's from the extracted control file, sorry for the spam [23:37:44] yeah whatever, it can go ahead and report [23:37:47] lemme fix thta up [23:38:32] RECOVERY - DPKG on snapshot1007 is OK: All packages OK [23:38:47] fixed, back on the old 3.7.2 [23:38:59] gotta disable puppet on a lot of hosts [23:39:04] .. icinga.. [23:39:09] ouch [23:39:43] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [23:39:50] jfc [23:41:12] fine, it should shut up now [23:43:42] 10Operations, 10Scap: scap is broken bash: /srv/deployment/scap/scap/bin/scap: No such file or directory - https://phabricator.wikimedia.org/T183046#3842035 (10thcipriani) [23:44:42] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:46:03] so it should be rolled back now? should I try to deploy? [23:46:46] (03PS1) 10BryanDavis: horizon: Update logos and file naming [puppet] - 10https://gerrit.wikimedia.org/r/398605 (https://phabricator.wikimedia.org/T168480) [23:46:49] scap_3.7.4-1 has the same depends [23:46:50] 10Operations, 10Scap: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046#3842040 (10thcipriani) [23:46:55] it will simply fail to install on the trusty hosts [23:46:57] shrug [23:47:06] so don't spkg -i there [23:47:09] *dpkg -i [23:47:19] mutante: [23:47:37] legoktm: nope [23:48:05] give a couple minutes yet [23:48:26] hrm, I don't know enough about packaging stuff to know why the depends are different than in the control file for 3.7.4-* :( [23:48:43] not your issue, nor ours for tonight [23:48:55] we'll just skip the trusty hosts and leave as is, they'll be fine [23:49:06] get 3.7.4-1 on the rest [23:49:32] might be nice to note on the ticket I guess [23:51:05] (03CR) 10BryanDavis: "I don't have a local Horizon to test the 125px-Cloud_VPS_dashboard_logo.png display in. I have an svg source file for it that we can use t" [puppet] - 10https://gerrit.wikimedia.org/r/398605 (https://phabricator.wikimedia.org/T168480) (owner: 10BryanDavis) [23:53:08] cumin -b 10 -s 5 'R:Package = scap' 'if dpkg -l scap | grep "3.7.4.2" && file /var/cache/apt/archives/scap_3.7.4-1_all.deb; then puppet agent --disable; apt-get remove --yes -q scap ; dpkg -i /var/cache/apt/archives/scap_3.7.4-1_all.deb ; fi' [23:53:40] only if 3.7.4.2 is installed AND we have 3.7.4-1 available.. then .. disable puppet, remove it, install -1 version [23:53:56] on whatever has scap installed. will skip trusty and no host list [23:54:05] seems ok [23:54:12] we can always check after [23:55:28] does file exit non-zero when it can't find something? [23:55:28] after this will be check to see across all scap targets, what versions they have? [23:55:30] test -f [23:55:30] !log downgrading scap from 3.7.4-2 to 3.7.4-1 where it is installed - cumin -b 10 -s 5 'R:Package = scap' 'if dpkg -l scap | grep "3.7.4.2" && file /var/cache/apt/archives/scap_3.7.4-1_all.deb; then puppet agent --disable; apt-get remove --yes -q scap ; dpkg -i /var/cache/apt/archives/scap_3.7.4-1_all.deb ; fi' targeting 478 hosts (T183046) [23:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:41] T183046: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046 [23:55:42] and make sure that there's no 3.7.4-2 anyplace I guess? [23:56:18] 11.7% success [23:56:24] if dpkg -l scap | grep '3.7.4.2' && test -f /var/cache/apt/archives/scap_3.7.4-1_all.deb [23:56:50] what do the failures look like? [23:57:20] it doesnt show you, it aborts it for "too high failure rate". after 56/478 [23:57:25] meh [23:57:26] thcipriani: right! thx [23:57:42] oh heh [23:58:44] hurries up before the icinga alerts kick in for puppet being disabled :p [23:58:56] you get an hour right? :-P [23:59:05] 117/478 [23:59:28] (03PS1) 10Chad: scap: Set bin_dir globally to /usr/bin [puppet] - 10https://gerrit.wikimedia.org/r/398606 (https://phabricator.wikimedia.org/T183046)