[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190222T0000). [00:00:05] Smalyshev and ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:32] \o [00:01:39] i suppose i can ship things today [00:02:59] (03PS4) 10EBernhardson: [cirrus] Switch production search traffic to codfw (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492044 (https://phabricator.wikimedia.org/T215931) [00:03:19] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492044 (https://phabricator.wikimedia.org/T215931) (owner: 10EBernhardson) [00:04:44] ottomata: there looks to be an undeployed mediawiki-config patch for 'Use eventbus multi endpoint configuration for eventbus configs' [00:04:51] ottomata: can i revert it? should i deploy it? [00:04:54] (03Merged) 10jenkins-bot: [cirrus] Switch production search traffic to codfw (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492044 (https://phabricator.wikimedia.org/T215931) (owner: 10EBernhardson) [00:05:24] ottomata: this one: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/490418/ [00:05:58] Pchelolo: you might know as well ^^% [00:06:43] i'll give it 5 minutes, then reverting since it was never deployed [00:07:01] ebernhardson: I've stepped away recently, ottomata was going to test it in beta, not sure what was the outcome [00:07:55] hmm, safest without otto around (its past the end of his typical work day) i'm going to revert. It can always be redeployed [00:08:42] ebernhardson: k. thank you, sorry for the inconvinience [00:09:04] (03PS1) 10EBernhardson: Revert "Use EventBus multi endpoint configuration for eventbus configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492217 [00:09:18] (03CR) 10EBernhardson: [C: 03+2] "Was merged but never deployed to prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492217 (owner: 10EBernhardson) [00:10:12] (03Merged) 10jenkins-bot: Revert "Use EventBus multi endpoint configuration for eventbus configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492217 (owner: 10EBernhardson) [00:10:34] no worries, reverts are pretty easy. Slightly worried it will break something in beta since i don't know much about it, but breaking beta seems to be the cool thing to do lately :) [00:11:07] ebernhardson: no, beta will be fine [00:11:07] excellent [00:11:07] it's totally backward-forward compatible [00:14:27] (03CR) 10jenkins-bot: [cirrus] Switch production search traffic to codfw (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492044 (https://phabricator.wikimedia.org/T215931) (owner: 10EBernhardson) [00:14:29] (03CR) 10jenkins-bot: Revert "Use EventBus multi endpoint configuration for eventbus configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492217 (owner: 10EBernhardson) [00:16:53] SMalyshev: around for your deployment? [00:17:14] !log ebernhardson@deploy1001 sync-file aborted: T215931 (duration: 00m 00s) [00:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:17] T215931: Upgrade elasticsearch to 5.6.14 - https://phabricator.wikimedia.org/T215931 [00:17:29] ebernhardson: yes [00:17:37] SMalyshev: ok i'll merge the first and pull to mwdebug1001 [00:17:47] sorry, got distracted reading some code :) [00:17:55] (03PS2) 10EBernhardson: Deploy WikibaseCirrusSearch: Part I, extensionlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490643 (owner: 10Jforrester) [00:18:12] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T215931 [cirrus] Switch production search traffic to codfw (1/2) (duration: 00m 46s) [00:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:16] ebernhardson: the first one probably would do nothing at all [00:18:17] (03CR) 10EBernhardson: [C: 03+2] "SWAT. afaict james -1 has been resolved by waiting the week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490643 (owner: 10Jforrester) [00:18:20] it's just extension list [00:18:28] ahh, yea that will be easy [00:18:52] sigh, now to look in logstash and see what errors canceld the cirrus cluster sync [00:18:57] s/sync/switchover/ [00:19:15] ebernhardson: 2 and 3 also don't do much by themselves... just need to ensure nothing explodes but otherwise they are pretty noop [00:19:27] (03Merged) 10jenkins-bot: Deploy WikibaseCirrusSearch: Part I, extensionlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490643 (owner: 10Jforrester) [00:20:05] sigh, the error rate increased from 0.2 to 2.0. The errors were very typical errors with diff's timing out [00:20:14] going to simply ship it again, they are completely unrelated [00:20:52] (03PS2) 10Smalyshev: Deploy WikibaseCirrusSearch: Part II, InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490644 (owner: 10Jforrester) [00:21:11] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T215931 [cirrus] Switch production search traffic to codfw (1/2) (duration: 00m 45s) [00:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:41] (03CR) 10EBernhardson: [C: 03+2] Deploy WikibaseCirrusSearch: Part II, InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490644 (owner: 10Jforrester) [00:23:16] !log ebernhardson@deploy1001 Synchronized wmf-config/extension-list: Deploy WikibaseCirrusSearch: Part I, extensionlist (duration: 00m 46s) [00:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:42] (03Merged) 10jenkins-bot: Deploy WikibaseCirrusSearch: Part II, InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490644 (owner: 10Jforrester) [00:24:32] ebernhardson: does https://wikidata.beta.wmflabs.org run from the same repos as production or it's different sync? [00:24:42] i really wish we had some warmup procedure for elastic clusters ... deploying only the small wikis still gave a temporary mean latency of ~900ms [00:25:16] SMalyshev: same repos, if its the same as before it has a cron that pulls every 10 minutes or something, but i could be way out of date on that information [00:25:42] (03CR) 10jenkins-bot: Deploy WikibaseCirrusSearch: Part I, extensionlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490643 (owner: 10Jforrester) [00:25:44] (03CR) 10jenkins-bot: Deploy WikibaseCirrusSearch: Part II, InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490644 (owner: 10Jforrester) [00:25:46] ok, so I was thinking also enabling extension on beta [00:25:58] SMalyshev: typically done from InitialiseSettings-labs.php [00:26:08] if nothing breaks then we can deploy it on test on Monday [00:26:13] sounds god [00:26:16] add an extra o [00:26:19] ebernhardson: yes https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/490646 [00:26:40] i'll rebase it [00:27:22] (03PS2) 10EBernhardson: Deploy WikibaseCirrusSearch: Part III, Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490645 (owner: 10Jforrester) [00:27:26] (03PS3) 10Smalyshev: Deploy WikibaseCirrusSearch: Part III, Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490645 (owner: 10Jforrester) [00:27:38] hehe you were faster :) [00:27:58] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Deploy WikibaseCirrusSearch: Part II, InitialiseSettings.php (duration: 00m 46s) [00:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:10] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490645 (owner: 10Jforrester) [00:28:29] SMalyshev: if i remember right, the third is still a noop. It loads the extension but nothing is turned on? [00:28:35] well, effectively a noop [00:29:20] ebernhardson: yes [00:29:51] ok, i'll pull that one to mwdebug anyways just to double check [00:29:55] since it actually loads the code [00:29:58] ebernhardson: yeah please [00:30:12] it shouldn't even load the ext but let's verify it [00:31:58] oh right, the var is set false [00:32:06] yep [00:33:46] not sure what's up with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/490645 - zuul finished but it doesn't seem to be merged [00:33:52] yea seeing the same [00:33:58] hmm [00:34:48] (03CR) 10EBernhardson: [C: 03+2] "try merging again?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490645 (owner: 10Jforrester) [00:35:50] (03Merged) 10jenkins-bot: Deploy WikibaseCirrusSearch: Part III, Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490645 (owner: 10Jforrester) [00:36:24] ok now we can test it [00:36:37] (03PS3) 10EBernhardson: [cirrus] Switch production search traffic to codfw (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492045 (https://phabricator.wikimedia.org/T215931) [00:36:50] SMalyshev: pulled to mwdebug1001 [00:37:10] (03PS2) 10Smalyshev: [BETA] Enable WikibaseCirrusSearch on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490646 (owner: 10Jforrester) [00:37:12] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492045 (https://phabricator.wikimedia.org/T215931) (owner: 10EBernhardson) [00:38:06] ebernhardson: everything seems to be normal [00:38:28] (03Merged) 10jenkins-bot: [cirrus] Switch production search traffic to codfw (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492045 (https://phabricator.wikimedia.org/T215931) (owner: 10EBernhardson) [00:39:24] !log ebernhardson@deploy1001 Synchronized wmf-config/Wikibase.php: Deploy WikibaseCirrusSearch: Part III, Wikibase.php (duration: 00m 45s) [00:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:52] (03CR) 10Smalyshev: [C: 04-1] "Waiting for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/490646 - if that goes fine, this one goes in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490647 (owner: 10Jforrester) [00:40:22] ebernhardson: ok I think we can do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/490646 now [00:43:23] (03CR) 10jenkins-bot: Deploy WikibaseCirrusSearch: Part III, Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490645 (owner: 10Jforrester) [00:43:25] (03CR) 10jenkins-bot: [cirrus] Switch production search traffic to codfw (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492045 (https://phabricator.wikimedia.org/T215931) (owner: 10EBernhardson) [00:44:07] SMalyshev: ok, one sec. Trying something by issuing a few queries with the most common words to codfw against all indices to try and warm it up. Probably not going to help but who knows... [00:44:20] ebernhardson: sure, no rush [00:45:23] !log ebernhardson@deploy1001 sync-file aborted: T215931 [cirrus] Switch production search traffic to codfw (2/2) (duration: 00m 05s) [00:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:26] T215931: Upgrade elasticsearch to 5.6.14 - https://phabricator.wikimedia.org/T215931 [00:45:42] RECOVERY - MegaRAID on labsdb1005 is OK: OK: optimal, 1 logical, 12 physical, WriteBack policy [00:46:23] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T215931 [cirrus] Switch production search traffic to codfw (2/2) (duration: 00m 46s) [00:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:00] huh, so far traffic is shifting but no major latency spike [00:50:38] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490646 (owner: 10Jforrester) [00:51:19] (03Merged) 10jenkins-bot: [BETA] Enable WikibaseCirrusSearch on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490646 (owner: 10Jforrester) [00:56:41] (03CR) 10jenkins-bot: [BETA] Enable WikibaseCirrusSearch on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490646 (owner: 10Jforrester) [00:56:56] last patch (labs noop [00:57:02] last patch (labs noop) shipping now, with that SWAT should be complete [00:57:06] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Noop sync of labs settings (duration: 00m 44s) [00:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:20] ebernhardson: so should it be on beta already? [00:58:42] SMalyshev: beta syncs on a cron job afaik, every 10 minutes or some such [00:58:47] i can probably log in and check, sec [00:58:50] ah ok I'll wait then [00:58:57] because so far not seeing it [00:59:43] SMalyshev: i see it in /srv/mediawiki-staging of beta cluster. hmm [00:59:52] anyone knows why is deployment-prep in readonly mode? [00:59:57] ebernhardson: the extension code or enabled? [01:00:00] Pchelolo: database fell over last weekend [01:00:06] Pchelolo: someone is trying to make innodb happy again [01:00:17] oh, ok.. :( [01:00:24] no testing for me.. [01:00:32] SMalyshev: the patch "[BETA] Enable WikibaseCirrusSearch on wikidata" [01:00:36] ebernhardson: WikibaseCirrusSearch ext should be loaded on Special:Version but isn't as far as I can see [01:02:01] hmm so why it's not loading? I'll wait for 10 mins [01:03:31] i wonder if beta sync is also broken [01:03:53] I see it in /srv/mediawiki and /srv/mediawiki-staging on deployment-deploy01.deployment-prep.eqiad.wmflabs. This implies scap was told to deploy from -staging to the main instance [01:04:22] but logging into deployment-mediawiki-07.deployment-prep.eqiad.wmflabs and checking /srv/mediawiki/wmf-config/InitialiseSettings-labs.php confirms it didn't make it this far [01:04:59] hmm [01:05:29] so is it "wait for sync" situation or "sync broken" situation? [01:05:38] SMalyshev: sync broken. I'm going to try manual [01:06:20] SMalyshev: i suppose scap is currently running, maybe its a very slow sync [01:07:53] SMalyshev: it was just very slow sync, i see it now (will depend which mw app server you hit) [01:08:42] haleluyah, extension is there! [01:09:11] also nothing seems to be terribly broken [01:09:14] which is good [01:09:16] great! [01:09:47] now on to figuring out why my unit tests suddenly broke... [01:10:55] ebernhardson: thanks for your help! [01:13:59] (03CR) 10Smalyshev: [C: 03+1] [BETA] Enable WikibaseCirrusSearch on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490647 (owner: 10Jforrester) [01:40:37] !log power-down cp5006 - T216717 [01:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:40] T216717: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 [01:49:02] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:49:12] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:49:12] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:49:14] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [01:49:16] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:49:18] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:49:24] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:49:26] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [01:49:28] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:49:32] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [01:49:34] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:49:34] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:49:40] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:49:52] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [01:49:52] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 70 connecting: (unnamed) not-conn: cp5006_v4, cp5006_v6 [01:50:02] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [01:50:08] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [01:50:08] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp5006_v4, cp5006_v6 [01:50:14] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:50:14] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5006_v4, cp5006_v6 [01:52:11] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10ayounsi) Swapped A3 and A4 [01:54:22] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [01:54:24] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 72 ESP OK [01:54:26] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [01:54:30] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 72 ESP OK [01:54:32] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK [01:54:32] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [01:54:38] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [01:54:48] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 72 ESP OK [01:54:50] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 72 ESP OK [01:55:00] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 72 ESP OK [01:55:06] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 72 ESP OK [01:55:06] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 72 ESP OK [01:55:12] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [01:55:12] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [01:55:16] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [01:55:24] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 64 ESP OK [01:55:24] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [01:55:26] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 72 ESP OK [01:55:28] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [01:55:28] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [01:58:21] !log power-down cp5007 - T216716 [01:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:26] T216716: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 [02:06:20] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10ayounsi) Swapped A1 with A2 and A4 with A5 [02:06:26] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [02:06:28] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [02:06:28] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [02:06:42] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [02:06:48] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [02:06:56] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [02:06:56] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [02:07:02] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [02:07:02] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [02:07:04] PROBLEM - IPsec on cp1075 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [02:07:08] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [02:07:08] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [02:07:10] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [02:07:22] PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [02:07:28] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp5007_v4, cp5007_v6 [02:07:32] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [02:07:32] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [02:07:32] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5007_v4, cp5007_v6 [02:09:50] RECOVERY - IPsec on cp1081 is OK: Strongswan OK - 56 ESP OK [02:09:54] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 56 ESP OK [02:09:58] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK [02:10:00] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 52 ESP OK [02:10:00] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK [02:10:08] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 56 ESP OK [02:10:08] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 52 ESP OK [02:10:08] RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 56 ESP OK [02:10:22] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 52 ESP OK [02:10:29] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK [02:10:36] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 52 ESP OK [02:10:36] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 52 ESP OK [02:10:42] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 52 ESP OK [02:10:44] RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 56 ESP OK [02:10:46] RECOVERY - IPsec on cp1075 is OK: Strongswan OK - 56 ESP OK [02:10:48] RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 56 ESP OK [02:10:50] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 52 ESP OK [02:10:52] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 56 ESP OK [02:14:25] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10ayounsi) a:05ayounsi→03RobH [02:14:46] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10ayounsi) a:05ayounsi→03RobH [02:58:22] 10Operations, 10Mail, 10Patch-For-Review: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10mmodell) @LarsWirzenius maybe this is why you weren't getting phabricator notifications? [03:26:19] !log delete old gr-1/0/0 from cr1-eqsin - T213121 [03:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:22] T213121: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 [04:25:06] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [04:30:47] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) >but we might be able to hack together an rsync pipeline for that in the short term if we have to I rather k... [04:33:24] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [05:38:06] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10ayounsi) a:05ayounsi→03Cmjohnson Talked to Chris, the server is actually in B8 [05:57:06] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] sessions: add (dummy) key material for session storage cluster [labs/private] - 10https://gerrit.wikimedia.org/r/492196 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [06:03:30] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2050 - https://phabricator.wikimedia.org/T216670 (10Marostegui) 05Open→03Resolved All good now, thank you! ` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) ` [06:04:23] (03PS2) 10Marostegui: toolsdb: Remove the word temporary from comments [puppet] - 10https://gerrit.wikimedia.org/r/492024 (https://phabricator.wikimedia.org/T216170) (owner: 10Bstorm) [06:05:07] (03CR) 10Marostegui: [C: 03+2] toolsdb: Remove the word temporary from comments [puppet] - 10https://gerrit.wikimedia.org/r/492024 (https://phabricator.wikimedia.org/T216170) (owner: 10Bstorm) [06:08:56] (03Abandoned) 10Marostegui: db-eqiad.php: Not use db1103,5:3312 on main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491901 (https://phabricator.wikimedia.org/T216656) (owner: 10Marostegui) [06:10:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492236 [06:11:28] PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100% [06:12:18] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492236 (owner: 10Marostegui) [06:13:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492236 (owner: 10Marostegui) [06:14:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492236 (owner: 10Marostegui) [06:15:54] !log Stop MySQL on db1087 for kernel and mysql upgrade [06:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:34] (03PS1) 10Vgutierrez: mirrors: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492239 (https://phabricator.wikimedia.org/T207389) [06:16:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1087 for MySQL upgrade (duration: 02m 53s) [06:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:53] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/14775/" [puppet] - 10https://gerrit.wikimedia.org/r/492239 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [06:24:19] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492241 [06:26:00] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492241 (owner: 10Marostegui) [06:27:03] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492241 (owner: 10Marostegui) [06:27:08] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492241 (owner: 10Marostegui) [06:30:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1087 after MySQL upgrade (duration: 02m 51s) [06:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:38] PROBLEM - puppet last run on cloudvirt1029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:32:54] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/hhvmadm] [06:33:24] (03PS1) 10Vgutierrez: mail: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492243 (https://phabricator.wikimedia.org/T207389) [06:36:43] (03CR) 10Vgutierrez: [C: 03+1] "pcc looking as expected: https://puppet-compiler.wmflabs.org/compiler1002/14776/" [puppet] - 10https://gerrit.wikimedia.org/r/492243 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [06:39:55] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Upgrade scap debian package to 3.9.0-1 - https://phabricator.wikimedia.org/T216666 (10greg) 05Open→03Resolved a:03fsero Thanks! [06:41:38] (03PS1) 10Vgutierrez: netbox: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492244 (https://phabricator.wikimedia.org/T207389) [06:41:51] (03PS1) 10Marostegui: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492245 [06:42:58] (03PS2) 10Marostegui: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492245 [06:43:48] (03CR) 10Vgutierrez: [C: 03+1] "pcc shows the expected changes: https://puppet-compiler.wmflabs.org/compiler1002/14777/" [puppet] - 10https://gerrit.wikimedia.org/r/492244 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [06:44:35] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492245 (owner: 10Marostegui) [06:45:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492245 (owner: 10Marostegui) [06:46:46] (03PS1) 10Vgutierrez: tendril: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492246 (https://phabricator.wikimedia.org/T207389) [06:48:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1013 for MySQL upgrade (duration: 02m 50s) [06:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:01] (03CR) 10Vgutierrez: [C: 03+1] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/14778/" [puppet] - 10https://gerrit.wikimedia.org/r/492246 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [06:49:06] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492245 (owner: 10Marostegui) [06:49:18] !log Stop MySQL on es1013 to upgrade MySQL [06:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:17] 10Operations, 10ops-eqiad, 10serviceops, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10Marostegui) This host crashed today again: ` ------------------------------------------------------------------------------- Record: 40 Date/Time: 02/22/2019 0... [06:51:19] (03CR) 10Vgutierrez: [C: 03+2] installserver: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490371 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [06:51:27] (03PS2) 10Vgutierrez: installserver: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490371 (https://phabricator.wikimedia.org/T207389) [06:51:38] !log Power cycle mw1272 as it crashed - T211668 [06:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:41] T211668: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 [06:54:44] RECOVERY - Host mw1272 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [06:56:30] (03CR) 10Vgutierrez: [C: 03+2] archiva: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490374 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [06:56:41] (03PS3) 10Vgutierrez: archiva: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490374 (https://phabricator.wikimedia.org/T207389) [06:57:48] RECOVERY - puppet last run on cloudvirt1029 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:02] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:04] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492248 [07:00:52] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492248 (owner: 10Marostegui) [07:01:46] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492248 (owner: 10Marostegui) [07:02:44] (03CR) 10Vgutierrez: [C: 03+2] dumps: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490376 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [07:02:54] (03PS2) 10Vgutierrez: dumps: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490376 (https://phabricator.wikimedia.org/T207389) [07:03:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool es1013 after MySQL upgrade (duration: 00m 45s) [07:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:08] (03CR) 10Vgutierrez: [C: 03+2] gerrit: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490379 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [07:07:18] (03PS2) 10Vgutierrez: gerrit: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490379 (https://phabricator.wikimedia.org/T207389) [07:10:43] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10ayounsi) Talking to @elukey about that. In eqiad only rows A and C have the cloud-support vlan: https://netbox.wikimedia.org/ipam/vlans/?q=cl... [07:11:42] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492248 (owner: 10Marostegui) [07:15:05] <_joe_> !log deactivating mw1272, memory problems [07:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:14] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) Had a chat with Manuel, if possible let's move the host in row A so we have 2 labsdb hosts in there and 2 in row C :) [07:18:47] (03PS1) 10Marostegui: db-eqiad.php: More traffic to es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492250 [07:19:44] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492250 (owner: 10Marostegui) [07:20:50] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492250 (owner: 10Marostegui) [07:20:59] (03CR) 10Vgutierrez: [C: 03+1] "pcc shows NOOP in lvs5001 and lvs5002 and the expected change in lvs5003: https://puppet-compiler.wmflabs.org/compiler1002/14779/" [puppet] - 10https://gerrit.wikimedia.org/r/490525 (https://phabricator.wikimedia.org/T213121) (owner: 10Ayounsi) [07:21:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give more traffic to es1013 after MySQL upgrade (duration: 00m 45s) [07:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:08] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492250 (owner: 10Marostegui) [07:23:38] PROBLEM - mediawiki-installation DSH group on mw1272 is CRITICAL: Host mw1272 is not in mediawiki-installation dsh group [07:27:38] (03PS1) 10Marostegui: db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492251 [07:27:40] (03CR) 10Vgutierrez: [C: 03+2] icinga: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490380 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [07:27:48] (03PS2) 10Vgutierrez: icinga: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490380 (https://phabricator.wikimedia.org/T207389) [07:28:02] !log manually delete WANCache:v:metawiki:translate-groups from memcache on mc1022 to test fix for T203786 [07:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:05] T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [07:34:23] (03CR) 10Vgutierrez: [C: 03+2] librenms: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490381 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [07:34:31] (03PS2) 10Vgutierrez: librenms: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490381 (https://phabricator.wikimedia.org/T207389) [07:38:19] (03CR) 10Vgutierrez: [C: 03+2] lists: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490382 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [07:38:28] (03PS2) 10Vgutierrez: lists: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/490382 (https://phabricator.wikimedia.org/T207389) [07:40:03] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10Joe) As far as I can see we're using packages generated by the following source packages: # `php` - this generates the binary packages `php7.2-bcmath php... [07:41:37] (03CR) 10Vgutierrez: [C: 03+2] mirrors: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492239 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [07:41:45] (03PS2) 10Vgutierrez: mirrors: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492239 (https://phabricator.wikimedia.org/T207389) [07:43:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492251 (owner: 10Marostegui) [07:45:05] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492251 (owner: 10Marostegui) [07:45:49] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492251 (owner: 10Marostegui) [07:46:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool es1013 after MySQL upgrade (duration: 00m 46s) [07:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:05] !log installing krb5 updates for jessie [07:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:05] (03CR) 10Hashar: "recheck" [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/491951 (https://phabricator.wikimedia.org/T215810) (owner: 10Fsero) [07:52:56] (03CR) 10Vgutierrez: [C: 03+2] mail: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492243 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [07:53:04] (03PS2) 10Vgutierrez: mail: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492243 (https://phabricator.wikimedia.org/T207389) [07:55:33] (03CR) 10Hashar: "FTBS!" [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/491951 (https://phabricator.wikimedia.org/T215810) (owner: 10Fsero) [07:57:28] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.19; 2019-02-26), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Some awesome news - the test that I did a... [07:57:38] \o/ ---^ [08:02:45] (03CR) 10Vgutierrez: [C: 03+2] netbox: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492244 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [08:02:56] (03PS2) 10Vgutierrez: netbox: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492244 (https://phabricator.wikimedia.org/T207389) [08:05:13] (03CR) 10Hashar: "recheck" [debs/tideways-xhprof] - 10https://gerrit.wikimedia.org/r/491515 (owner: 10Giuseppe Lavagetto) [08:07:05] (03CR) 10Vgutierrez: [C: 03+2] tendril: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492246 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [08:07:15] (03PS2) 10Vgutierrez: tendril: Deploy acme_chief TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/492246 (https://phabricator.wikimedia.org/T207389) [08:09:52] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "It seems this became obsolete by now, or at least it does have a conflict now. Do you want to redo it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455188 (owner: 10Aklapper) [08:11:33] (03CR) 10Mathew.onipe: "> Patch Set 7:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [08:12:35] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review: Add CI to all operations/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180330 (10hashar) [08:15:59] (03PS8) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [08:18:15] !log temporarily stop prometheus global on prometheus2004 to take a snapshot [08:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:01] !log installing uriparser security updates [08:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:17] (03CR) 10Mathew.onipe: cloudelastic: Add cloudelastic configs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [08:24:47] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [08:25:08] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [08:27:45] (03PS1) 10Filippo Giunchedi: mirrors: move to syncproxy2.wna.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/492257 [08:28:03] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) I imported all the cables except the servers' uplinks, see: https://netbox.wikimedia.org/dcim/cables/?page=6 (and previous pages) I ended up using the circuits feature of Netbox instead of the previously mentione... [08:29:56] (03PS2) 10Filippo Giunchedi: mirrors: move to syncproxy2.wna.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/492257 [08:32:19] (03PS1) 10Vgutierrez: install_server: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492259 (https://phabricator.wikimedia.org/T207389) [08:32:22] (03PS1) 10Vgutierrez: install_server: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492260 (https://phabricator.wikimedia.org/T207389) [08:35:50] (03CR) 10Vgutierrez: [C: 03+1] "as shown in https://phabricator.wikimedia.org/T207389#4974970 the acme-chief certificates are already in place as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/492259 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [08:38:04] (03CR) 10Vgutierrez: [C: 03+1] "pcc shows the expected changes https://puppet-compiler.wmflabs.org/compiler1002/14780/" [puppet] - 10https://gerrit.wikimedia.org/r/492260 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [08:39:15] (03CR) 10Muehlenhoff: [C: 03+1] mirrors: move to syncproxy2.wna.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/492257 (owner: 10Filippo Giunchedi) [08:39:17] (03CR) 10jenkins-bot: elasticsearch: fix typo (xarg instead of xargs) [software/spicerack] - 10https://gerrit.wikimedia.org/r/491960 (https://phabricator.wikimedia.org/T207920) (owner: 10Gehel) [08:42:42] (03PS1) 10Vgutierrez: archiva: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492264 (https://phabricator.wikimedia.org/T207389) [08:42:44] (03PS1) 10Vgutierrez: archiva: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492265 (https://phabricator.wikimedia.org/T207389) [08:43:25] (03PS2) 10Gehel: elasticsearch: check size of all replicas, not just primary shards [puppet] - 10https://gerrit.wikimedia.org/r/490852 [08:43:45] (03Abandoned) 10Gehel: elasticsearch: check size of all replicas, not just primary shards [puppet] - 10https://gerrit.wikimedia.org/r/490852 (owner: 10Gehel) [08:45:33] (03CR) 10Filippo Giunchedi: [C: 03+2] mirrors: move to syncproxy2.wna.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/492257 (owner: 10Filippo Giunchedi) [08:46:54] (03PS1) 10Gehel: elasticsearch: upgrade elasticsearch / cirrus to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/492266 (https://phabricator.wikimedia.org/T215931) [08:50:48] (03CR) 10DCausse: [C: 03+1] elasticsearch: upgrade elasticsearch / cirrus to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/492266 (https://phabricator.wikimedia.org/T215931) (owner: 10Gehel) [08:51:38] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. [08:51:39] (03CR) 10Vgutierrez: [C: 03+1] "as shown in https://phabricator.wikimedia.org/T207389#4974970 acme-chief certificates have been deployed successfully in the affected serv" [puppet] - 10https://gerrit.wikimedia.org/r/492264 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [08:52:03] !log force ftpsync run on sodium after debian mirror update [08:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:42] (03CR) 10Gehel: [C: 03+2] elasticsearch: upgrade elasticsearch / cirrus to 5.6.14 [puppet] - 10https://gerrit.wikimedia.org/r/492266 (https://phabricator.wikimedia.org/T215931) (owner: 10Gehel) [08:58:24] (03CR) 10Vgutierrez: [C: 03+1] "pcc shows the expected changes: https://puppet-compiler.wmflabs.org/compiler1001/14782/" [puppet] - 10https://gerrit.wikimedia.org/r/492265 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [09:04:23] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10jcrespo) I am creating a snapshot right now for testing purposes, will run a dumping process next. [09:11:38] 10Operations: Rollout of updated microcode for Westmere-EP CPUs - https://phabricator.wikimedia.org/T216802 (10MoritzMuehlenhoff) [09:13:18] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [09:16:34] !log starting rolling upgrade on elasticsearch / cirrus / eqiad - T215931 [09:16:36] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [09:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:37] T215931: Upgrade elasticsearch to 5.6.14 - https://phabricator.wikimedia.org/T215931 [09:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:45] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [09:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:59] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 14s) [09:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:13] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=99) [09:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:34] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10hashar) May we update our base Docker container as well? Looking at `docker-registry.wikimedia.org/wikimedia-stretch:latest` (29397cdce9f7): ` base-files/stable 9.9+deb9u8 amd64 [upgradable from: 9.9+deb9u6] gpgv/stab... [09:19:32] (03PS1) 10Alexandros Kosiaris: Add initialize_service.sh tool [deployment-charts] - 10https://gerrit.wikimedia.org/r/492269 [09:22:02] !log updated tor packages to 0.3.5.8-1~d90.stretch+1 [09:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:41] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [09:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:57] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 16s) [09:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:41] 10Operations, 10Multimedia, 10Thumbor, 10serviceops, and 2 others: Deploy 3d2png to thumbor servers (stretch) - https://phabricator.wikimedia.org/T216494 (10Gilles) This patch works and I successfully deployed it on thumbor2002 and thumbor1004, where 3d2png now works on Stretch. [09:26:48] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [09:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:49] <_joe_> !log set pooled=inactive on mw1272, T211668 [09:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:53] T211668: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 [09:33:27] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [09:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:34] !log installing tor security update on torrelay1001 [09:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:39] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [09:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:26] (03PS1) 10Muehlenhoff: Remove reprepro config for tor on jessie [puppet] - 10https://gerrit.wikimedia.org/r/492271 [09:38:50] !log akosiaris@deploy1001 scap-helm citoid install -n staging -f citoid-staging-values.yaml stable/citoid [namespace: citoid, clusters: staging] [09:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:48] (03PS2) 10Muehlenhoff: Remove reprepro config for tor on jessie [puppet] - 10https://gerrit.wikimedia.org/r/492271 [09:46:38] (03PS3) 10Muehlenhoff: Remove reprepro config for tor on jessie [puppet] - 10https://gerrit.wikimedia.org/r/492271 [09:47:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove reprepro config for tor on jessie [puppet] - 10https://gerrit.wikimedia.org/r/492271 (owner: 10Muehlenhoff) [09:51:08] !log fixed package state on mw2167 [09:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:15] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [09:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:58] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10fgiunchedi) >>! In T204567#4972487, @Papaul wrote: > @fgiunchedi is it possible to depool this server for me to do a firmware upgrade before I resolve the task? Yes, a clean shutdown of the host is enough,... [09:53:04] RECOVERY - puppet last run on thumbor1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:54:28] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:55:17] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [09:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:33] (03PS1) 10Alexandros Kosiaris: Add citoid, cxserver kubernetes tokens [puppet] - 10https://gerrit.wikimedia.org/r/492273 (https://phabricator.wikimedia.org/T213194) [10:02:28] (03PS1) 10Marostegui: analytics-grants.sql: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/492275 (https://phabricator.wikimedia.org/T216491) [10:10:55] 10Operations, 10ops-eqiad: ms-be1033 down and not powering up - https://phabricator.wikimedia.org/T215998 (10fgiunchedi) Thanks @Cmjohnson for digging into it! AFAICS the host came back fine, I guess we'll wait and see if it happens again. In the meantime I'm fine with stalling/resolving this task if that work... [10:15:29] !log Pooling thumbor1004 after upgrade - T214597 [10:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:33] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [10:18:14] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10MoritzMuehlenhoff) >>! In T216712#4974896, @Joe wrote: > As far as I can see we're using packages generated by the following source packages: > > # `php`... [10:35:55] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [10:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:12] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [10:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:46] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) [10:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:23] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10Qgil) [10:49:43] (03PS50) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [10:51:31] (03PS51) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [10:52:10] (03CR) 10Jbond: [C: 03+2] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [10:53:14] (03PS2) 10Marostegui: analytics-grants.sql: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/492275 (https://phabricator.wikimedia.org/T216491) [10:53:50] (03CR) 10Marostegui: [C: 03+2] analytics-grants.sql: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/492275 (https://phabricator.wikimedia.org/T216491) (owner: 10Marostegui) [10:55:27] (03CR) 10Elukey: Monitor stream.wikimedia.org public endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) (owner: 10Ottomata) [10:58:27] (03PS1) 10Vgutierrez: dumps: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492280 (https://phabricator.wikimedia.org/T207389) [10:58:30] (03PS1) 10Vgutierrez: dumps: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492281 (https://phabricator.wikimedia.org/T207389) [11:01:21] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.19; 2019-02-26), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) It definitely keeps happening: {F28263176} [11:01:23] !log swift eqiad set thumbor write ACLs for wikipedia-meta-local-thumb [11:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:06] (03CR) 10Vgutierrez: [C: 03+1] "As shown in https://phabricator.wikimedia.org/T207389#4974970 the acme-chief certificates have been successfully deployed to the affected " [puppet] - 10https://gerrit.wikimedia.org/r/492280 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:02:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) 05Open→03Resolved MySQL will be stopped the 4th of March as a final part of the deprecation of this host. It has been on read on... [11:07:04] (03CR) 10Vgutierrez: [C: 03+1] "pcc shows the expected changes: https://puppet-compiler.wmflabs.org/compiler1002/14784/" [puppet] - 10https://gerrit.wikimedia.org/r/492281 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:18:45] (03PS1) 10Vgutierrez: gerrit: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492283 (https://phabricator.wikimedia.org/T207389) [11:18:47] (03PS1) 10Vgutierrez: gerrit: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492284 (https://phabricator.wikimedia.org/T207389) [11:20:20] !log imported intel-microcode 3.20180807a.2 for jessie-wikimedia (T216802) [11:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:24] T216802: Rollout of updated microcode for Westmere-EP CPUs - https://phabricator.wikimedia.org/T216802 [11:21:01] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492284 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:21:30] (03CR) 10Vgutierrez: [C: 03+1] "As shown in https://phabricator.wikimedia.org/T207389#4974970 the acme-chief certificates have been successfully deployed in the affected " [puppet] - 10https://gerrit.wikimedia.org/r/492283 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:23:03] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) FYI it crashed again: ` -------------------------------------------------------------------------------- SeqNumber = 481 Message ID = SYS1003 Category = A... [11:23:53] (03PS2) 10Vgutierrez: gerrit: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492284 (https://phabricator.wikimedia.org/T207389) [11:25:06] (03PS2) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [11:29:43] (03CR) 10Vgutierrez: [C: 03+1] "pcc looking as expected: https://puppet-compiler.wmflabs.org/compiler1002/14785/" [puppet] - 10https://gerrit.wikimedia.org/r/492284 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:31:34] (03PS3) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [11:32:53] !log Pooling thumbor2002 after upgrade - T214597 [11:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:57] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [11:33:48] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [11:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:01] !log rebooting cp1008 for some microcode test [11:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:09] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [11:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:35] (03CR) 10jerkins-bot: [V: 04-1] Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [11:36:48] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Meta Swift container rights incorrect for thumbor user - https://phabricator.wikimedia.org/T216807 (10jijiki) [11:36:50] (03PS1) 10Vgutierrez: icinga: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492285 (https://phabricator.wikimedia.org/T207389) [11:36:52] (03PS1) 10Vgutierrez: icinga: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492286 (https://phabricator.wikimedia.org/T207389) [11:37:19] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Meta Swift container rights incorrect for thumbor user - https://phabricator.wikimedia.org/T216807 (10jijiki) [11:37:25] (03PS4) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [11:37:27] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [11:41:17] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [11:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:41] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [11:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:08] (03CR) 10Vgutierrez: [C: 03+1] "As shown in https://phabricator.wikimedia.org/T207389#4974970 icinga2001 got the acme-chief certificates as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/492285 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:45:52] (03CR) 10jerkins-bot: [V: 04-1] Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [11:46:57] 10Operations, 10Thumbor, 10serviceops: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) p:05Triage→03Normal [11:47:14] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/14787/" [puppet] - 10https://gerrit.wikimedia.org/r/492286 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:50:28] (03PS5) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [11:53:31] (03PS1) 10Vgutierrez: librenms: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492288 (https://phabricator.wikimedia.org/T207389) [11:53:33] (03PS1) 10Vgutierrez: librenms: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492289 (https://phabricator.wikimedia.org/T207389) [11:54:32] !log various reboots of servers with Westmere-EP CPUs to pick up updated microcode to address SSBD/L1TF [11:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:40] (03CR) 10jerkins-bot: [V: 04-1] Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [11:55:33] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: Upgrade acme-chief to run in debian buster - https://phabricator.wikimedia.org/T215925 (10Vgutierrez) 05Open→03Resolved acme-chief is running successfully in debian buster :) [11:57:54] (03CR) 10Vgutierrez: [C: 03+1] "As shown in https://phabricator.wikimedia.org/T207389#4974970 the acme-chief certificates have been successfully deployed in the affected " [puppet] - 10https://gerrit.wikimedia.org/r/492288 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [12:04:05] 10Operations, 10Thumbor, 10serviceops: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [12:04:11] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [12:04:16] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [12:04:42] (03CR) 10Vgutierrez: [C: 03+1] "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/14788/" [puppet] - 10https://gerrit.wikimedia.org/r/492289 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [12:08:21] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [12:12:39] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [12:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:58] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [12:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:26] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [12:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:27] (03PS1) 10Vgutierrez: lists: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492292 (https://phabricator.wikimedia.org/T207389) [12:16:29] (03PS1) 10Vgutierrez: lists: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492293 (https://phabricator.wikimedia.org/T207389) [12:17:31] !log rebooting tungsten to pick up updated microcode to address SSBD/L1TF [12:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:54] (03CR) 10Vgutierrez: [C: 03+1] "As shown in https://phabricator.wikimedia.org/T207389#4974970 acme-chief certificates have been successfully deployed to the involved serv" [puppet] - 10https://gerrit.wikimedia.org/r/492292 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [12:22:57] (03CR) 10Vgutierrez: [C: 03+1] "pcc shows the expected changes: https://puppet-compiler.wmflabs.org/compiler1002/14789/" [puppet] - 10https://gerrit.wikimedia.org/r/492293 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [12:25:54] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:28:24] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:28:39] (03PS1) 10Vgutierrez: mirrors: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492295 (https://phabricator.wikimedia.org/T207389) [12:28:41] (03PS1) 10Vgutierrez: mirrors: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492296 (https://phabricator.wikimedia.org/T207389) [12:31:20] what was that spike? [12:31:41] (03CR) 10Vgutierrez: [C: 03+1] "As shown in https://phabricator.wikimedia.org/T207389#4974970 the acme-chief certificates have been successfully deployed to the affected " [puppet] - 10https://gerrit.wikimedia.org/r/492295 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [12:32:26] a rise in 500 [12:32:48] and there is a high rise of 404 since this morning, too [12:33:02] double the normal rate [12:33:31] also a 3x rise in requests [12:34:26] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:34:52] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/14790/sodium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/492296 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [12:37:30] ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] amusso spam of changes. No worries. https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:40:50] (03PS1) 10Gehel: elasticsearch: allow starting the operations without waiting for green [cookbooks] - 10https://gerrit.wikimedia.org/r/492299 [12:42:36] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: Allow specifying a custom period of time before deploying a newly issued certificate - https://phabricator.wikimedia.org/T213737 (10Krenair) Is this resolved now? Also, duplicate of {T204997} ? [12:43:41] !log rebooting auth1002 for kernel update [12:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:58] 10Operations, 10Acme-chief, 10Traffic, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [12:44:04] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: Allow specifying a custom period of time before deploying a newly issued certificate - https://phabricator.wikimedia.org/T213737 (10Vgutierrez) 05Open→03Resolved yes, this has been included as part of the latest release [12:45:20] (03PS1) 10Alexandros Kosiaris: Introduce cxserver helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/492301 (https://phabricator.wikimedia.org/T213195) [12:45:38] 10Operations, 10Acme-chief, 10Traffic: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10Krenair) Has this been solved by {T213737}? [12:48:48] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) [12:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:48] (03PS2) 10Gehel: elasticsearch: allow starting the operations without waiting for green [cookbooks] - 10https://gerrit.wikimedia.org/r/492299 [12:51:55] (03CR) 10DCausse: [C: 03+1] elasticsearch: allow starting the operations without waiting for green [cookbooks] - 10https://gerrit.wikimedia.org/r/492299 (owner: 10Gehel) [12:52:25] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: allow starting the operations without waiting for green [cookbooks] - 10https://gerrit.wikimedia.org/r/492299 (owner: 10Gehel) [12:52:58] RECOVERY - Check systemd state on labtestmetal2001 is OK: OK - running: The system is fully operational [12:54:38] (03CR) 10Gehel: [C: 03+2] elasticsearch: allow starting the operations without waiting for green [cookbooks] - 10https://gerrit.wikimedia.org/r/492299 (owner: 10Gehel) [12:54:46] (03CR) 10Gehel: [C: 03+2] elasticsearch: provide a better datetime format example [cookbooks] - 10https://gerrit.wikimedia.org/r/491991 (owner: 10Gehel) [12:56:03] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [12:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:59] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:11] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:03] 10Operations, 10ops-codfw: Degraded RAID on heze-array1 - https://phabricator.wikimedia.org/T206909 (10MoritzMuehlenhoff) JFTR, the disks are now in such a poor state that errors are being thrown by the OS, when installing a software update is was displaying errors like: ` /dev/bacula/baculasd2: read fai... [13:04:02] 10Operations, 10ops-codfw: Degraded RAID on heze-array1 - https://phabricator.wikimedia.org/T206909 (10jijiki) 05Resolved→03Open [13:09:10] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:15] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:16] (03PS1) 10Gehel: [WIP] log relocating shards during cluster restart [software/spicerack] - 10https://gerrit.wikimedia.org/r/492307 [13:11:00] 10Operations: Rollout of updated microcode for Westmere-EP CPUs - https://phabricator.wikimedia.org/T216802 (10MoritzMuehlenhoff) These servers are running trusty and they are so old that they will be decomissioned when they are phased out along with Trusty, they won't get updated (plus, there's no intel-microco... [13:16:10] (03CR) 10jerkins-bot: [V: 04-1] [WIP] log relocating shards during cluster restart [software/spicerack] - 10https://gerrit.wikimedia.org/r/492307 (owner: 10Gehel) [13:16:47] (03PS1) 10Gehel: elasticsearch: upgrade rows one after the other [software/spicerack] - 10https://gerrit.wikimedia.org/r/492308 [13:16:53] 10Operations, 10Acme-chief, 10Traffic: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Indeed. [13:17:31] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:17] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:36] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:21:44] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10fgiunchedi) >>! In T213976#4974604, @Nuria wrote: >>but we might be able to hack together an rsync pipeline for tha... [13:23:39] (03PS1) 10Jcrespo: mariadb: Remove tls configuration from the client [puppet] - 10https://gerrit.wikimedia.org/r/492309 [13:23:50] (03PS1) 10Vgutierrez: exim: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492310 (https://phabricator.wikimedia.org/T207389) [13:23:53] (03PS1) 10Vgutierrez: mail: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492311 (https://phabricator.wikimedia.org/T207389) [13:24:52] (03PS2) 10Jcrespo: mariadb: Remove tls configuration from the client [puppet] - 10https://gerrit.wikimedia.org/r/492309 [13:25:18] (03CR) 10Vgutierrez: [C: 03+1] "As shown in https://phabricator.wikimedia.org/T207389#4974970 the acme-chief certificates have been successfully deployed in the affected " [puppet] - 10https://gerrit.wikimedia.org/r/492310 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [13:25:24] !log installing wireshark security updates [13:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:57] (03PS1) 10GTirloni: Revert "ircecho: Convert script to python3" [puppet] - 10https://gerrit.wikimedia.org/r/492312 (https://phabricator.wikimedia.org/T215416) [13:26:09] (03CR) 10Jcrespo: "Controversial patch, which I tried to avoid, but I cave to practicality." [puppet] - 10https://gerrit.wikimedia.org/r/492309 (owner: 10Jcrespo) [13:26:57] (03CR) 10Jcrespo: "It was also technically wrong, as it assumes the client should be using the same certificate than the server, which should not be the case" [puppet] - 10https://gerrit.wikimedia.org/r/492309 (owner: 10Jcrespo) [13:28:00] (03CR) 10Vgutierrez: [C: 03+1] "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/14791/" [puppet] - 10https://gerrit.wikimedia.org/r/492311 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [13:30:59] (03CR) 10GTirloni: [C: 03+2] Revert "ircecho: Convert script to python3" [puppet] - 10https://gerrit.wikimedia.org/r/492312 (https://phabricator.wikimedia.org/T215416) (owner: 10GTirloni) [13:35:35] (03PS2) 10Vgutierrez: lists: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492292 (https://phabricator.wikimedia.org/T207389) [13:35:37] (03PS2) 10Vgutierrez: lists: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492293 (https://phabricator.wikimedia.org/T207389) [13:35:39] (03PS2) 10Vgutierrez: mirrors: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492295 (https://phabricator.wikimedia.org/T207389) [13:35:41] (03PS2) 10Vgutierrez: mirrors: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492296 (https://phabricator.wikimedia.org/T207389) [13:35:43] (03PS2) 10Vgutierrez: exim: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492310 (https://phabricator.wikimedia.org/T207389) [13:35:45] (03PS2) 10Vgutierrez: mail: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492311 (https://phabricator.wikimedia.org/T207389) [13:36:32] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:36:45] 10Operations, 10Mail, 10Patch-For-Review: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10LarsWirzenius) @mmodell Yes, that is my conclusion as well. My apologies if I hadn't communicated that. [13:38:57] 10Operations, 10Mail, 10Patch-For-Review: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10mmodell) @LarsWirzenius I'm embarassed now, I didn't notice that it was you who filed the task :-o [13:39:31] (03CR) 10Paladox: "The package is in debian stretch though https://packages.debian.org/stretch/python3-irc" [puppet] - 10https://gerrit.wikimedia.org/r/492312 (https://phabricator.wikimedia.org/T215416) (owner: 10GTirloni) [13:40:54] (03PS1) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/492314 [13:43:36] (03PS1) 10Vgutierrez: netbox: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492317 (https://phabricator.wikimedia.org/T207389) [13:43:38] (03PS1) 10Vgutierrez: netbox: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492318 (https://phabricator.wikimedia.org/T207389) [13:43:48] (03PS1) 10Filippo Giunchedi: deployment-prep: use deployment-prometheus02 [puppet] - 10https://gerrit.wikimedia.org/r/492319 [13:44:19] volunteers to review ^ ? simple enough / beta-only [13:45:58] (03CR) 10Vgutierrez: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/492310 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [13:46:12] (03CR) 10Vgutierrez: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/492311 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [13:47:12] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:54] (03CR) 10Vgutierrez: [C: 03+1] "As shown in https://phabricator.wikimedia.org/T207389#4974970 the acme-chief certificates have been successfully deployed in the affected " [puppet] - 10https://gerrit.wikimedia.org/r/492318 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [13:48:18] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.119 second response time [13:49:14] (03CR) 10Vgutierrez: [C: 03+1] "As shown in https://phabricator.wikimedia.org/T207389#4974970 the acme-chief certificates have been successfully deployed in the affected " [puppet] - 10https://gerrit.wikimedia.org/r/492317 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [13:50:31] (03PS1) 10Muehlenhoff: mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 [13:50:34] (03CR) 10GTirloni: [C: 04-1] "shinken is running on Jessie. shinken is not available on Stretch." [puppet] - 10https://gerrit.wikimedia.org/r/492314 (owner: 10Paladox) [13:51:11] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks happy and shows the expected changes: https://puppet-compiler.wmflabs.org/compiler1002/14792/" [puppet] - 10https://gerrit.wikimedia.org/r/492318 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [13:52:54] (03CR) 10Filippo Giunchedi: [C: 03+2] deployment-prep: use deployment-prometheus02 [puppet] - 10https://gerrit.wikimedia.org/r/492319 (owner: 10Filippo Giunchedi) [13:53:02] (03PS2) 10Filippo Giunchedi: deployment-prep: use deployment-prometheus02 [puppet] - 10https://gerrit.wikimedia.org/r/492319 [13:54:05] (03CR) 10Marostegui: "misc, dbstore_multiinstance etc?" [puppet] - 10https://gerrit.wikimedia.org/r/492309 (owner: 10Jcrespo) [13:54:11] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] deployment-prep: use deployment-prometheus02 [puppet] - 10https://gerrit.wikimedia.org/r/492319 (owner: 10Filippo Giunchedi) [13:54:18] !log reboot helium for kernel/microcode updates [13:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:50] PROBLEM - Host helium is DOWN: PING CRITICAL - Packet loss = 100% [13:56:18] RECOVERY - Host helium is UP: PING WARNING - Packet loss = 66%, RTA = 37.28 ms [13:56:52] (03CR) 10Jcrespo: [C: 03+1] "Looks good, although needs compiler check before deploy" [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [13:57:40] (03CR) 10Jcrespo: "I can do those separately or here, I wanted you hear any opinion first 0:-)" [puppet] - 10https://gerrit.wikimedia.org/r/492309 (owner: 10Jcrespo) [13:58:14] (03CR) 10Marostegui: [C: 03+1] "My opinion is a +1!" [puppet] - 10https://gerrit.wikimedia.org/r/492309 (owner: 10Jcrespo) [13:58:21] (03PS1) 10Vgutierrez: tendril: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492326 (https://phabricator.wikimedia.org/T207389) [13:58:23] (03PS1) 10Vgutierrez: tendril: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492327 (https://phabricator.wikimedia.org/T207389) [14:00:17] (03CR) 10Vgutierrez: [C: 03+1] "As shown in https://phabricator.wikimedia.org/T207389#4974970 the acme-chief certificates have been deployed successfully in the affected " [puppet] - 10https://gerrit.wikimedia.org/r/492326 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:01:03] (03PS2) 10Gehel: elasticsearch: upgrade rows one after the other [software/spicerack] - 10https://gerrit.wikimedia.org/r/492308 [14:01:46] (03CR) 10Jcrespo: "I am not going to merge this on a friday, but do you want me to do the others at the same time or separatelly?" [puppet] - 10https://gerrit.wikimedia.org/r/492309 (owner: 10Jcrespo) [14:02:04] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [14:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:15] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [14:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:47] jouncebot: now [14:02:47] No deployments scheduled for the next 68 hour(s) and 27 minute(s) [14:03:51] !log removed labvirt1008 from debmonitor (T216661) [14:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:55] T216661: cloudVPS: drain and rebuild labvirt1008 as cloudvirt1008 - https://phabricator.wikimedia.org/T216661 [14:05:24] Is there a way I can have something deployed its a really easy patch that shouldnt cause issues my schedule just doesn't allow me to easily make it to a SWAT window [14:05:57] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/14793/" [puppet] - 10https://gerrit.wikimedia.org/r/492327 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:08:21] (03PS1) 10Jcrespo: tendril: Use strong cipher for tendril [puppet] - 10https://gerrit.wikimedia.org/r/492329 [14:08:39] (03PS2) 10Jcrespo: tendril: Use strong cipher for tendril [puppet] - 10https://gerrit.wikimedia.org/r/492329 [14:12:46] (03CR) 10Vgutierrez: [C: 03+1] "I'm happy, pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/14794/" [puppet] - 10https://gerrit.wikimedia.org/r/492329 (owner: 10Jcrespo) [14:14:06] (03CR) 10Vgutierrez: [C: 03+2] install_server: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492259 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:14:23] (03PS2) 10Vgutierrez: install_server: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492259 (https://phabricator.wikimedia.org/T207389) [14:15:32] (03CR) 10Jcrespo: [C: 03+2] tendril: Use strong cipher for tendril [puppet] - 10https://gerrit.wikimedia.org/r/492329 (owner: 10Jcrespo) [14:16:00] sigh.. merging stuff it's going to be hard today [14:16:29] vgutierrez: still slow eh? I see test-prio is empty atm at https://integration.wikimedia.org/zuul/ [14:17:12] (03PS3) 10Vgutierrez: install_server: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492259 (https://phabricator.wikimedia.org/T207389) [14:18:43] jynus: do you mind if I merge your tendril patch? [14:18:57] I was on it, do you want to do it? [14:19:10] I got a prompt asking for it along with one of my patches [14:19:11] ;P [14:19:18] go on, then [14:19:25] I will check its application [14:19:53] on dbmonitor, not to be confused with debmonitor [14:19:57] done [14:20:05] even if both will host flask applications [14:20:45] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) >>! In T216384#4975063, @hashar wrote: > May we update our base Docker container as well? Looking at `docker-registry.wikimedia.org/wikimedia-stretch:latest` (29397cdce9f7): Agreed, that makes sens... [14:20:53] Notice: /Stage[main]/Httpd/Service[apache2]: Triggered 'refresh' from 1 events [14:21:25] so tendril is still ugly, that patch didn't fix that! [14:22:52] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [14:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:00] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [14:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:20] jynus: there is nothing wrong with tendril being ugly [14:23:44] (03PS11) 10Eevans: Initial configuration for session storage service [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) [14:23:55] Oh, I think https://dbtree.wikimedia.org/ is also served there [14:24:12] not sure actually [14:24:20] I know the backend is the same [14:24:34] (03CR) 10Vgutierrez: [C: 03+2] install_server: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492260 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:24:44] (03PS2) 10Vgutierrez: install_server: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492260 (https://phabricator.wikimedia.org/T207389) [14:28:07] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) Ok! @bmansurov it sounds like we might be blocking you for a bit more while we solve this problem. I'm... [14:31:06] (03CR) 10Vgutierrez: [C: 03+2] archiva: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492264 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:31:11] (03PS2) 10Vgutierrez: archiva: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492264 (https://phabricator.wikimedia.org/T207389) [14:33:24] (03CR) 10Ottomata: "Sorry about that. I was merging to test in beta, but then beta was in readonly mode." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492217 (owner: 10EBernhardson) [14:34:05] 10Operations, 10Epic, 10Maps (Kartotherian): Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10Mathew.onipe) [14:34:14] 10Operations, 10Epic, 10Maps (Kartotherian): Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10Mathew.onipe) p:05Triage→03Normal [14:35:09] (03CR) 10Vgutierrez: [C: 03+2] archiva: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492265 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:35:35] (03PS2) 10Vgutierrez: archiva: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492265 (https://phabricator.wikimedia.org/T207389) [14:36:18] (03CR) 10Ottomata: "In IRC volans wrote:" [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) (owner: 10Ottomata) [14:39:32] (03PS4) 10Jcrespo: mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) [14:39:48] (03PS3) 10Ottomata: Monitor stream.wikimedia.org public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) [14:44:29] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10akosiaris) @Cmjohnson Any news on this ? [14:47:22] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10bmansurov) @Ottomata OK, thank you! [14:47:29] 10Operations, 10ops-codfw: Degraded RAID on heze-array1 - https://phabricator.wikimedia.org/T206909 (10akosiaris) 05Open→03Resolved Re-closing per T206909#4734830 backup2001 is actually setup and working fine. I was waiting on T196478 to resume working on both of backup2001 and backup1001 together and t... [14:50:26] (03CR) 10KartikMistry: Introduce cxserver helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/492301 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [14:53:06] 10Operations, 10Mail, 10Patch-For-Review: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10herron) 05Open→03Resolved a:03herron Looking much better now! ` Received-SPF: pass (google.com: domain of no-reply@phabricator.wikimedia.o... [14:56:39] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [14:58:56] 10Operations, 10Mail, 10Patch-For-Review: gmail considers all Phabricator email to be spam due to missing SPF record - https://phabricator.wikimedia.org/T216714 (10LarsWirzenius) I've not had any Phabricator mail end up in the spam folder since yesterday! So I confirm it seems to work. Thank you! [15:03:19] (03CR) 10Elukey: Monitor stream.wikimedia.org public endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) (owner: 10Ottomata) [15:09:40] 10Operations: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832 (10MoritzMuehlenhoff) [15:09:56] 10Operations: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832 (10MoritzMuehlenhoff) p:05Triage→03Low [15:11:06] (03CR) 10Vgutierrez: [C: 03+2] dumps: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492280 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [15:11:12] (03PS2) 10Vgutierrez: dumps: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492280 (https://phabricator.wikimedia.org/T207389) [15:15:01] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [15:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:06] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [15:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:58] 10Operations: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832 (10MoritzMuehlenhoff) I did a little code research: In Debian releases up to Stretch, the su binary is built from the shadow source package. But starting with Buster it switched to the su implementati... [15:19:07] (03CR) 10Vgutierrez: [C: 03+2] dumps: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492281 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [15:19:16] (03PS2) 10Vgutierrez: dumps: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492281 (https://phabricator.wikimedia.org/T207389) [15:21:41] gehel: o/ - do you still want to get notified when eventbus throws some errors (like message payload too big etc..) ? [15:22:09] (we have a specific grafana based alert) [15:22:42] elukey: I'm keeping an eye on the graph [15:22:57] and making sure it goes down before the next batch of servers [15:23:30] yep yep, I mean the POST 4XX statuses in eventbus (I think you are referring to the consumer lag right?) [15:23:32] elukey: let me know if this is putting problematic load on kafka, but the lag in itself is not a worry from my side [15:23:37] yep [15:23:57] :O [15:23:58] POST 4XX are the messages too large? [15:23:59] nono the only errors that I see now is sometimes POSTs ending up in 4xx, [15:24:03] yeah [15:24:33] nah, erik and david know about those, there are no immediate actions to be taken [15:24:40] ack then :) [15:25:00] I'll need to do a reindex after the upgrade if we care about those updates (spoiler: we probably do) [15:25:24] (03PS1) 10Eevans: sessions: updated key material [labs/private] - 10https://gerrit.wikimedia.org/r/492338 (https://phabricator.wikimedia.org/T215883) [15:29:51] 10Operations, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10Aklapper) [15:30:24] (03PS3) 10Kosta Harlan: GrowthExperiments: Soft launch of help panel on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489729 (https://phabricator.wikimedia.org/T215666) [15:30:47] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) Interesting! So a new CPU2 from Dell is throwing errors, and it replaced another CPU that was throwing errors. [15:32:23] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [15:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:43] (03PS13) 10Ammarpad: Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) [15:33:05] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [15:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:36] (03PS6) 10Ammarpad: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 [15:34:01] (03PS5) 10Ammarpad: Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) [15:35:58] Hi, can anyone run namespaceDupes.php on ckbwiki for T216806 [15:35:59] T216806: Run script to fix inconsistent titles for Kurdish (Sorani) Wikipedia - https://phabricator.wikimedia.org/T216806 [15:36:11] (03PS5) 10Ammarpad: Add new namespaces for th.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491054 (https://phabricator.wikimedia.org/T216322) [15:36:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga2001 is CRITICAL: 1.002e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:37:46] (03PS6) 10Bstorm: wiki replicas: Remove reference to old comment fields [puppet] - 10https://gerrit.wikimedia.org/r/489242 (https://phabricator.wikimedia.org/T212972) (owner: 10Anomie) [15:39:23] (03CR) 10Bstorm: [C: 03+2] wiki replicas: Remove reference to old comment fields [puppet] - 10https://gerrit.wikimedia.org/r/489242 (https://phabricator.wikimedia.org/T212972) (owner: 10Anomie) [15:41:18] (03PS1) 10Gehel: elasticsearch: use the admin Reason to get current hostname [software/spicerack] - 10https://gerrit.wikimedia.org/r/492341 [15:42:23] (03PS1) 10Zoranzoki21: Removed empty space on line 376 of maintain-views.yaml [puppet] - 10https://gerrit.wikimedia.org/r/492343 [15:43:06] (03PS2) 10Zoranzoki21: Removed empty space in line 376 of maintain-views.yaml [puppet] - 10https://gerrit.wikimedia.org/r/492343 [15:43:28] (03PS4) 10Ottomata: Monitor stream.wikimedia.org public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) [15:43:41] bstorm_: Can you check and this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/492343/ [15:44:30] (03CR) 10Ottomata: Monitor stream.wikimedia.org public endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) (owner: 10Ottomata) [15:45:24] That one bugs you, does it :) I can merge that [15:45:35] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: use the admin Reason to get current hostname [software/spicerack] - 10https://gerrit.wikimedia.org/r/492341 (owner: 10Gehel) [15:46:30] (03CR) 10Bstorm: [C: 03+2] Removed empty space in line 376 of maintain-views.yaml [puppet] - 10https://gerrit.wikimedia.org/r/492343 (owner: 10Zoranzoki21) [15:46:57] bstorm_: Loves you <3 [15:47:06] :) [15:47:27] (03PS2) 10Gehel: elasticsearch: use the admin Reason to get current hostname [software/spicerack] - 10https://gerrit.wikimedia.org/r/492341 [15:47:45] bstorm_: Tnx. Do you have access to run maintenance scripts for MediaWiki? [15:49:02] Access, potentially, context making it a good idea, not very likely. [15:49:30] (03PS5) 10Ottomata: Monitor stream.wikimedia.org public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) [15:50:03] I just maintain the views on the wiki replicas and some other cloud things like that [15:50:12] (03PS1) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492344 (https://phabricator.wikimedia.org/T191086) [15:50:44] (03PS2) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492344 (https://phabricator.wikimedia.org/T213969) [15:51:52] bstorm_: I think on deploy server.. For T216322 [15:51:53] T216322: Create a new namespaces on thai wikimedia projects - https://phabricator.wikimedia.org/T216322 [15:51:58] Oops wrong task [15:52:06] I think on T216806 [15:52:07] T216806: Run script to fix inconsistent titles for Kurdish (Sorani) Wikipedia - https://phabricator.wikimedia.org/T216806 [15:52:37] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Cmjohnson) That’s not really all that interesting, they send us refurbished parts all the time that don’t work. I will submit another ticket for a new CPU. [15:53:53] Yeah, nope, not it. :) [15:54:28] It's not valid anyway [15:54:29] Access: uncertain, context and such to be able to do things safely: definitely not [15:54:56] Reedy; What is not valid? [15:55:00] the task [15:55:04] There's nothing namespaceDupes to fix [15:55:10] https://phabricator.wikimedia.org/T216806#4976309 [15:55:31] Reex [15:55:33] Oops [15:55:34] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:55:36] *Reedy: Ok tnx [15:56:32] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga2001 is OK: (C)1e+05 gt (W)1e+04 gt 8669 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:56:54] (03PS1) 10Bmansurov: Stop collecting data for CitaitonUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492345 (https://phabricator.wikimedia.org/T213969) [15:57:02] (03CR) 10Elukey: [C: 03+1] "All right then, looks good! I'd prefer to have all hiera calls and not profile instantiated but it is only a nit, feel free to merge!" [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) (owner: 10Ottomata) [15:57:26] (03CR) 10Ottomata: "Looks ok now:" [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) (owner: 10Ottomata) [15:58:05] (03CR) 10Ottomata: [C: 03+2] Monitor stream.wikimedia.org public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) (owner: 10Ottomata) [15:58:11] (03PS6) 10Ottomata: Monitor stream.wikimedia.org public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/492199 (https://phabricator.wikimedia.org/T215013) [16:06:57] (03PS1) 10Muehlenhoff: Explicitly install ruby-safe-yaml to fix Puppet Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/492346 (https://phabricator.wikimedia.org/T213546) [16:11:43] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:12:03] (03PS3) 10CRusnov: ganeti: Change ownership of rapi users file to match required ownership [puppet] - 10https://gerrit.wikimedia.org/r/492203 (https://phabricator.wikimedia.org/T215229) [16:18:24] (03CR) 10Elukey: [V: 03+2 C: 03+2] sessions: updated key material [labs/private] - 10https://gerrit.wikimedia.org/r/492338 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [16:18:39] urandom: ---^ [16:20:21] anybody checking the MW exceptions? [16:21:09] seems to be concentrated on one wiki afaics [16:21:59] indeed, jawiki afaics [16:22:04] yep [16:23:29] not sure what the exception means though [16:25:01] elukey: thanks! [16:27:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:29:46] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [16:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:53] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [16:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:55] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10Cmjohnson) backup1001 is all connected now, I do notice that the raid card is not picking up any of the disk arrays. [16:37:27] (03CR) 10Thcipriani: [C: 03+2] scap prep: ensure old directories exist, add logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490900 (owner: 10Thcipriani) [16:38:32] (03Merged) 10jenkins-bot: scap prep: ensure old directories exist, add logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490900 (owner: 10Thcipriani) [16:42:54] (03CR) 10jenkins-bot: scap prep: ensure old directories exist, add logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490900 (owner: 10Thcipriani) [16:52:40] (03PS1) 10Ottomata: Use nagios_common::check_command::config to define check_eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/492349 (https://phabricator.wikimedia.org/T215013) [16:56:44] (03CR) 10Ottomata: [C: 03+2] Use nagios_common::check_command::config to define check_eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/492349 (https://phabricator.wikimedia.org/T215013) (owner: 10Ottomata) [17:00:58] (03PS1) 10Ottomata: Use user/group root for icinga check command for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/492350 [17:04:58] (03CR) 10Ottomata: [C: 03+2] Use user/group root for icinga check command for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/492350 (owner: 10Ottomata) [17:06:01] PROBLEM - puppet last run on icinga2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/icinga/commands/check_eventstreams.cfg] [17:06:22] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [17:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:24] ^^ me [17:06:27] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [17:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:38] 10Operations, 10ORES, 10Scoring-platform-team: [Discuss] ORES without celery - https://phabricator.wikimedia.org/T216838 (10Halfak) [17:11:11] RECOVERY - puppet last run on icinga2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:13:06] !log cp5006: repooling into service - T216717 [17:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:10] T216717: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 [17:14:09] !log cp5007: repooling into service - T216716 [17:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:13] T216716: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 [17:18:26] (03CR) 10Paladox: [C: 04-2] "Support for this is being merged upstream, ill abandon this change later!" [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 (owner: 10Paladox) [17:23:30] 10Operations, 10ops-eqiad: ms-be1033 down and not powering up - https://phabricator.wikimedia.org/T215998 (10Cmjohnson) 05Open→03Resolved Resolving, feel free to open if the problem returns [17:24:21] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T214778 (10Cmjohnson) @fgiunchedi Let's do this on Monday if you are available and now that ms-be1033 is working again. [17:33:07] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [17:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:12] (03PS1) 10Ottomata: Revert "Revert "Use EventBus multi endpoint configuration for eventbus configs"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492356 [17:33:12] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [17:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:14] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) [17:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:20] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-upgrade [17:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:21] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) [17:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:52] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: Prepare puppet for Debian buster - https://phabricator.wikimedia.org/T213546 (10GTirloni) [18:00:10] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [18:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:15] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [18:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:43] !log rolling upgrade on elasticsearch / cirrus / eqiad completed - T215931 [18:02:43] (03CR) 10Paladox: [C: 04-2] "test" [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/490225 (owner: 10Paladox) [18:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:46] T215931: Upgrade elasticsearch to 5.6.14 - https://phabricator.wikimedia.org/T215931 [18:29:57] (03PS1) 10Ottomata: Install maven and ivysettings on all hadoop workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/492361 (https://phabricator.wikimedia.org/T216093) [18:32:56] (03PS5) 10Jcrespo: mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) [18:33:18] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [18:34:19] 10Operations, 10serviceops, 10MW-1.33-notes (1.33.0-wmf.19; 2019-02-26), 10Patch-For-Review, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Krinkle) @Joe I understand the choice between VCL in Varnish and client-side JS favouring the latter. While I'm no... [18:35:11] (03CR) 10Jcrespo: "So I got a full snapshot cycle- It took a long time, with only one failure- the latest backup wasn't archived to archive due to checking i" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [18:37:23] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/14802/" [puppet] - 10https://gerrit.wikimedia.org/r/492361 (https://phabricator.wikimedia.org/T216093) (owner: 10Ottomata) [18:50:42] (03CR) 10Krinkle: [C: 03+1] Never try apt_pkg when parsing control [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487849 (owner: 10Hashar) [18:51:46] (03CR) 10EBernhardson: [C: 03+1] "ivysettings looks good. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/492361 (https://phabricator.wikimedia.org/T216093) (owner: 10Ottomata) [18:52:33] (03PS2) 10Hashar: Never try apt_pkg when parsing control [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487849 (https://phabricator.wikimedia.org/T216836) [18:53:07] (03CR) 10Hashar: [C: 03+1] "Timo had the same concern and filled T216836, so I have merely attached this change to the task :]" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487849 (https://phabricator.wikimedia.org/T216836) (owner: 10Hashar) [19:39:33] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [19:40:23] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10jijiki) [19:40:37] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [19:40:39] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10jijiki) [19:41:15] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [19:41:18] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10jijiki) [19:43:11] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [19:44:17] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [19:44:24] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628 (10jijiki) [19:44:45] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628 (10jijiki) [19:44:47] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [19:45:03] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628 (10jijiki) [19:46:19] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [19:46:24] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628 (10jijiki) [19:47:49] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [19:47:51] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628 (10jijiki) [19:51:13] (03PS1) 10Andrew Bogott: toolforge: remove use of '::imagemagick::install' [puppet] - 10https://gerrit.wikimedia.org/r/492372 [19:51:58] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [19:52:27] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: remove use of '::imagemagick::install' [puppet] - 10https://gerrit.wikimedia.org/r/492372 (owner: 10Andrew Bogott) [19:53:34] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [19:54:43] 10Operations, 10Performance-Team, 10Thumbor, 10serviceops: Meta Swift container rights incorrect for thumbor user - https://phabricator.wikimedia.org/T216807 (10jijiki) 05Open→03Resolved Will reopen if we run into this again, works for now. [19:54:47] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [20:07:43] (03PS1) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [20:09:10] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/compiler1002/14803/sessionstore1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [20:11:48] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [20:11:53] (03CR) 10Mobrovac: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [20:19:28] (03PS2) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [20:23:43] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [20:27:33] (03CR) 10Gehel: "I think I disagree with prospector on this one. This is a valid use case for an assert." [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [20:39:18] 10Operations, 10Discovery-Search, 10Epic, 10Patch-For-Review: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10debt) [20:39:21] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Test spicerack elasticsearch module - https://phabricator.wikimedia.org/T207920 (10debt) 05Open→03Resolved [20:45:43] (03PS1) 10Gehel: elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 [20:46:23] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Create new elastic56 component in reprepro and upload elasticsearch and plugins - https://phabricator.wikimedia.org/T216047 (10debt) 05Open→03Resolved [20:49:59] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 (owner: 10Gehel) [20:59:03] (03PS2) 10Gehel: elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 [21:04:40] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 (owner: 10Gehel) [21:14:35] 10Operations, 10Parsoid, 10RESTBase, 10Traffic, and 5 others: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 (10mobrovac) We have logged some VE requests on the RB side, and it turns out we cannot rely on the session ID to be present in the request. VE does send it, but o... [21:16:08] (03PS1) 10Herron: WIP: rsyslog: change udp_localhost_compat to define [puppet] - 10https://gerrit.wikimedia.org/r/492390 [21:16:10] (03PS1) 10Herron: WIP: rsyslog: add mwlog udp_localhost_compat config to mwlog_shipper [puppet] - 10https://gerrit.wikimedia.org/r/492391 [21:16:16] 10Operations, 10Proton, 10Core Platform Team Backlog (Watching / External), 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (watching): Proton fails with Chromium 72.0.3626.96 - https://phabricator.wikimedia.org/T216493 (10mobrovac) Perhaps we should consider packaging the fixed Chromium versi... [21:16:46] (03CR) 10jerkins-bot: [V: 04-1] WIP: rsyslog: change udp_localhost_compat to define [puppet] - 10https://gerrit.wikimedia.org/r/492390 (owner: 10Herron) [21:17:02] (03CR) 10jerkins-bot: [V: 04-1] WIP: rsyslog: add mwlog udp_localhost_compat config to mwlog_shipper [puppet] - 10https://gerrit.wikimedia.org/r/492391 (owner: 10Herron) [21:19:15] (03PS2) 10Herron: WIP: rsyslog: change udp_localhost_compat to define [puppet] - 10https://gerrit.wikimedia.org/r/492390 [21:21:01] (03PS2) 10Herron: WIP: rsyslog: add mwlog udp_localhost_compat config to mwlog_shipper [puppet] - 10https://gerrit.wikimedia.org/r/492391 [21:24:36] (03PS3) 10Herron: WIP: rsyslog: change udp_localhost_compat to define [puppet] - 10https://gerrit.wikimedia.org/r/492390 [21:26:28] (03PS4) 10Herron: WIP: rsyslog: change udp_localhost_compat to define [puppet] - 10https://gerrit.wikimedia.org/r/492390 [21:27:03] (03CR) 10jerkins-bot: [V: 04-1] WIP: rsyslog: change udp_localhost_compat to define [puppet] - 10https://gerrit.wikimedia.org/r/492390 (owner: 10Herron) [21:32:32] 10Operations, 10MediaWiki-Debug-Logger, 10Performance-Team: Set up request debug profiling for PHP 7 - https://phabricator.wikimedia.org/T206152 (10Krinkle) 05Open→03Resolved There are some minor differences between how HHVM/XHProf work and how php72-tideways work. For example, HHVM used fake stack frame... [21:32:44] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [21:34:01] (03PS5) 10Herron: WIP: rsyslog: change udp_localhost_compat to define [puppet] - 10https://gerrit.wikimedia.org/r/492390 [21:42:57] (03PS6) 10Herron: WIP: rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390 [21:44:50] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/14808/" [puppet] - 10https://gerrit.wikimedia.org/r/492390 (owner: 10Herron) [21:45:16] (03CR) 10Herron: "still wip, but interested in feedback on the approach" [puppet] - 10https://gerrit.wikimedia.org/r/492390 (owner: 10Herron) [22:38:12] (03PS4) 10Kosta Harlan: GrowthExperiments: Soft launch of help panel on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489729 (https://phabricator.wikimedia.org/T215666) [23:08:26] (03PS3) 10Gehel: elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 [23:12:36] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 (owner: 10Gehel) [23:36:18] (03PS5) 10Ppchelko: Switch kafka logging to EventBus logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) [23:37:16] (03CR) 10jerkins-bot: [V: 04-1] Switch kafka logging to EventBus logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) (owner: 10Ppchelko) [23:43:51] (03PS6) 10Ppchelko: Add eventbus analytics logging alongside with kafka logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) [23:44:49] (03CR) 10jerkins-bot: [V: 04-1] Add eventbus analytics logging alongside with kafka logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) (owner: 10Ppchelko) [23:45:44] (03PS7) 10Ppchelko: Add eventbus analytics logging alongside with kafka logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163)