[00:00:02] <brennen>	 Daimona: holding this window for resolving the current train situation.
[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210122T0000).
[00:00:04] <jouncebot>	 annet and kemayo: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:07] <Jdlrobson>	 brennen: zuul is killing me
[00:00:26] <Kemayo>	 Mmmm, stickers.
[00:00:41] <brennen>	 above cc: annet, Kemayo, Urbanecm.
[00:00:46] <annet>	 I'm here!
[00:01:05] <brennen>	 currently dealing with a train blocker; hopefully we can get that cleared out in time for backport window to proceed.
[00:01:13] <brennen>	 apologies for the delay.
[00:01:26] <Kemayo>	 No worries
[00:01:31] <annet>	 no prob! thanks for the update
[00:01:41] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:02:33] <mutante>	 ^ something went wrong during reimaging.. got it
[00:03:17] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2370.codfw.wmnet with reason: new install on buster
[00:03:17] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2370.codfw.wmnet with reason: new install on buster
[00:03:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:23] <Urbanecm>	 Ack brennen, feel free to ping me if deployment can resume. 
[00:03:39] <brennen>	 will do.
[00:04:27] <Kemayo>	 My stuff won't be greatly hurt if it has to wait for Monday.
[00:06:25] <Urbanecm>	 brennen: is there sth I can do to help train?
[00:06:40] <wikibugs>	 (03CR) 10Bstorm: "Deployed this to toolsbeta and kicked over one of the api server pods just in case. It didn't seem to care. Merging." [puppet] - 10https://gerrit.wikimedia.org/r/639883 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm)
[00:06:55] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: AdmissionsConfiguration is GA after 1.17 [puppet] - 10https://gerrit.wikimedia.org/r/639883 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm)
[00:07:02] <brennen>	 Urbanecm: i think we're good, thank you though
[00:07:11] <Jdlrobson>	 just waiting on zuul
[00:07:11] <thcipriani>	 are we just staring really hard at https://integration.wikimedia.org/zuul/ at the moment? That mobilefrontend patch?
[00:07:17] <Jdlrobson>	 yep
[00:07:20] <Jdlrobson>	 pretty much
[00:07:27] <Jdlrobson>	 wondering why it takes so long
[00:07:37] <Urbanecm>	 brennen: ack :)
[00:07:46] <brennen>	 thcipriani: correct.
[00:07:47] <Jdlrobson>	 also watching hundreds thousands of errors coming in on my other tab
[00:08:00] <thcipriani>	 ugh
[00:08:44] <Urbanecm>	 Force merge is a thing if we can't wait
[00:08:50] <wikibugs>	 (03CR) 10Bstorm: "I don't think we need to restart the apiservers just now, but they will be fine when they do restart. This will also keep things sane when" [puppet] - 10https://gerrit.wikimedia.org/r/639883 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm)
[00:10:31] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2370.codfw.wmnet
[00:10:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:11:32] <wikibugs>	 (03Merged) 10jenkins-bot: Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638) (owner: 10Brennen Bearnes)
[00:11:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH)
[00:13:00] <Jdlrobson>	 brennen: finally!
[00:13:01] <icinga-wm>	 ACKNOWLEDGEMENT - Host releases2002 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T272555
[00:14:06] <wikibugs>	 (03PS2) 10Dzahn: Remove obsolete role installserver::apt [puppet] - 10https://gerrit.wikimedia.org/r/657569 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff)
[00:14:08] <wikibugs>	 (03PS1) 10Dzahn: admin: update SSH key for Volker_E [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628)
[00:14:36] <Jdlrobson>	 brennen: ready to test
[00:14:53] <brennen>	 Jdlrobson: syncing, one moment
[00:15:04] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Adjust AF config for ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657706
[00:15:30] <brennen>	 Jdlrobson: should be on mwdebug1002
[00:15:56] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2372.codfw.wmnet'] `  an...
[00:16:05] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2370 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[00:16:46] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2370.codfw.wmnet'] `  an...
[00:16:47] <Urbanecm>	 Daimona_: I recommend clarifying the commit message if possible and linking that private task in commit message (assuming it's related)
[00:18:23] <Jdlrobson>	 brennen: testing
[00:18:26] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2372.codfw.wmnet
[00:18:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:14] <Daimona_>	 Yeah it is
[00:19:14] <Jdlrobson>	 brennen: please sync!
[00:19:32] <brennen>	 Jdlrobson: thanks, syncing
[00:19:41] <Urbanecm>	 Daimona_: will sync it then once deployment resume (should happen soon)
[00:19:58] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Adjust AF config for ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657706 (https://phabricator.wikimedia.org/T272330)
[00:20:20] <Daimona_>	 Thank you :)
[00:20:38] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[00:20:42] <logmsgbot>	 !log brennen@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/MobileFrontend: Backport: [[gerrit:657702|Fix toggling storage cleanup (T272638)]] (duration: 01m 07s)
[00:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:47] <stashbot>	 T272638: TypeError: null is not an object (evaluating 't[e.title]')  on mobile domain - https://phabricator.wikimedia.org/T272638
[00:21:16] <wikibugs>	 (03PS1) 10RobH: dhcp entries for new db systems [puppet] - 10https://gerrit.wikimedia.org/r/657710 (https://phabricator.wikimedia.org/T267043)
[00:22:12] <wikibugs>	 (03CR) 10RobH: [C: 03+2] dhcp entries for new db systems [puppet] - 10https://gerrit.wikimedia.org/r/657710 (https://phabricator.wikimedia.org/T267043) (owner: 10RobH)
[00:23:39] <Jdlrobson>	 brennen  are we still going to roll 26 > 27?
[00:23:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Adjust AF config for ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657706 (https://phabricator.wikimedia.org/T272330) (owner: 10Daimona Eaytoy)
[00:24:03] <Jdlrobson>	 If not I will need to prepare a  backport for 1.36.0-wmf.26
[00:24:14] <Urbanecm>	 ^^that was approved by brennen in other channel^^
[00:24:17] <brennen>	 Jdlrobson: yeah, a quick config patch going out first then that.
[00:24:32] <Jdlrobson>	 👍
[00:25:24] <wikibugs>	 (03Merged) 10jenkins-bot: Adjust AF config for ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657706 (https://phabricator.wikimedia.org/T272330) (owner: 10Daimona Eaytoy)
[00:25:54] <brennen>	 Jdlrobson: i perhaps wasn't quite thinking clearly about this, but i guess that should be backported to .26 in case a rollback becomes necessary regardless?
[00:26:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH)
[00:27:07] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d4f5d6f09977962be1c49471432125a92357ede6: Temporarily amend ukwiki AF configuration (T272330) (duration: 01m 03s)
[00:27:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:11] <Urbanecm>	 brennen: over to you :)
[00:28:03] <brennen>	 Urbanecm: ack, rolling to group2.
[00:28:45] <wikibugs>	 (03PS1) 10Brennen Bearnes: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657712
[00:28:47] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657712 (owner: 10Brennen Bearnes)
[00:29:31] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657712 (owner: 10Brennen Bearnes)
[00:30:20] <Jdlrobson>	 brennen: nope
[00:30:23] <Jdlrobson>	 provided we roll forward
[00:30:27] <Jdlrobson>	 no backport to .26 needed
[00:30:37] <Jdlrobson>	 the only reason we have an error is we backported after running bad wmf27 code
[00:30:39] <Jdlrobson>	 it's all a bit confusing
[00:31:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['db1156.eqiad.wmnet', 'db1157.eqiad.wmnet', 'db1158.eqiad.wmnet',...
[00:31:22] <brennen>	 yeah, i'm very much used to thinking in the other direction.
[00:31:26] <Jdlrobson>	 haha me too
[00:31:30] <Jdlrobson>	 my head is almost exploding
[00:31:37] <logmsgbot>	 !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.27
[00:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:33:29] <brennen>	 ok, there we are.  Urbanecm, i think perhaps we should give it a minute or two, but you should be clear for backports after that.
[00:33:57] <thcipriani>	 \o/ well trained
[00:34:13] <Urbanecm>	 brennen: ack. May I +2 backports to save some zuul waiting?
[00:34:22] <brennen>	 please do
[00:34:48] <Urbanecm>	 annet: Kemayo: your backports will be ready once they merge, fyi :)
[00:35:06] <annet>	 awesome
[00:35:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Distinguish between null continue value and unknown one [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657623 (https://phabricator.wikimedia.org/T272548) (owner: 10Anne Tomasevich)
[00:35:14] <Kemayo>	 Righteous.
[00:35:20] <Jdlrobson>	 brennen: we're live?
[00:35:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] A/B test output when a specific feature is being tested [extensions/DiscussionTools] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657653 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch)
[00:35:41] <brennen>	 Jdlrobson: we are.  i... think client errors are slowly dropping?
[00:35:51] <Jdlrobson>	 brennen: im not seeing the error when testing so that's good
[00:35:52] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2370.codfw.wmnet
[00:35:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:35:55] <Jdlrobson>	 ill keep an eye on the graphs :)
[00:36:13] <Jdlrobson>	 5 - 10 mins of data should show us we're clear
[00:36:18] <brennen>	 cool.
[00:37:10] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2372.codfw.wmnet
[00:37:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:07] <Jdlrobson>	 so far so good :)
[00:38:17] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[00:39:12] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[00:39:41] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2376.codfw.wmnet with reason: REIMAGE
[00:39:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:40:13] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[00:41:41] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2376.codfw.wmnet with reason: REIMAGE
[00:41:42] <wikibugs>	 (03PS1) 10Urbanecm: Don't return the status of doBlockInternal when processing block actions [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657654
[00:41:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:41:45] <wikibugs>	 (03PS2) 10Legoktm: Don't return the status of doBlockInternal when processing block actions [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657654 (owner: 10Urbanecm)
[00:41:56] <legoktm>	 hah
[00:42:00] <Urbanecm>	 eh, double cherry-pick legoktm :)
[00:42:03] <wikibugs>	 (03CR) 10Dzahn: "I amended to also remove the cumin alias. Double checking now" [puppet] - 10https://gerrit.wikimedia.org/r/657569 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff)
[00:42:14] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Don't return the status of doBlockInternal when processing block actions [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657654 (owner: 10Urbanecm)
[00:42:45] <brennen>	 Jdlrobson: looking pretty good i'd say.
[00:43:34] <Jdlrobson>	 thanks brennen 
[00:43:39] <Jdlrobson>	 yeh they are tailing off
[00:43:46] <Jdlrobson>	 with caching might take a while for them to completely disappear
[00:44:12] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1166.eqiad.wmnet with reason: REIMAGE
[00:44:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:15] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE
[00:44:15] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1164.eqiad.wmnet with reason: REIMAGE
[00:44:16] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1168.eqiad.wmnet with reason: REIMAGE
[00:44:16] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1169.eqiad.wmnet with reason: REIMAGE
[00:44:16] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1170.eqiad.wmnet with reason: REIMAGE
[00:44:16] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1160.eqiad.wmnet with reason: REIMAGE
[00:44:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:16] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1165.eqiad.wmnet with reason: REIMAGE
[00:44:17] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: REIMAGE
[00:44:18] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1163.eqiad.wmnet with reason: REIMAGE
[00:44:18] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1167.eqiad.wmnet with reason: REIMAGE
[00:44:18] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: REIMAGE
[00:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:18] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1161.eqiad.wmnet with reason: REIMAGE
[00:44:19] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE
[00:44:19] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1174.eqiad.wmnet with reason: REIMAGE
[00:44:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:30] <robh>	 so many pings
[00:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:48] * Jdlrobson shakes his fist and shouts "get off my lawn!"
[00:44:54] <mutante>	 heh
[00:45:14] <robh>	 i love the reimage script anyhow
[00:45:25] * mutante adds more of that - also reimaging :)
[00:45:28] <robh>	 i didnt have to manually image a dozen hosts, its awesome
[00:45:29] <icinga-wm>	 RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:24] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: REIMAGE
[00:46:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:27] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1164.eqiad.wmnet with reason: REIMAGE
[00:46:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:41] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10JKatzWMF) Approved, thanks!
[00:47:26] <mutante>	 robh: 250 new Icinga checks now in PENDING, heh
[00:47:45] <robh>	 dbs gonna be happy when they wake up
[00:47:49] <robh>	 all new metal
[00:47:53] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2007824168 and 90 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:48:02] <robh>	 dbas that is
[00:48:17] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE
[00:48:18] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1169.eqiad.wmnet with reason: REIMAGE
[00:48:18] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1170.eqiad.wmnet with reason: REIMAGE
[00:48:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:18] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1160.eqiad.wmnet with reason: REIMAGE
[00:48:19] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1165.eqiad.wmnet with reason: REIMAGE
[00:48:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 169804832 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:48:41] <Urbanecm>	 we're probably going to need to extend the B&C for a while, considering zuul says 13 to 20 minutes for the pending backports
[00:49:16] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1168.eqiad.wmnet with reason: REIMAGE
[00:49:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:17] <mutante>	 robh: when doing a lot of mw i would say maybe 1 in 10 or 1 in 20 I had "failed to set downtime" from reimaging script. but if i see it I just "manually" use the downtime cookbook
[00:49:17] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1158.eqiad.wmnet with reason: REIMAGE
[00:49:18] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1163.eqiad.wmnet with reason: REIMAGE
[00:49:18] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1167.eqiad.wmnet with reason: REIMAGE
[00:49:18] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1157.eqiad.wmnet with reason: REIMAGE
[00:49:18] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1161.eqiad.wmnet with reason: REIMAGE
[00:49:18] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1174.eqiad.wmnet with reason: REIMAGE
[00:49:18] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE
[00:49:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:37] <mutante>	 robh: this is using regex in a single command?
[00:49:43] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 202600 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:49:51] <robh>	 used: sudo -i wmf-auto-reimage -p T267043 --new --force db1156.eqiad.wmnet db1157.eqiad.wmnet db1158.eqiad.wmnet db1159.eqiad.wmnet db1160.eqiad.wmnet db1161.eqiad.wmnet db1162.eqiad.wmnet db1163.eqiad.wmnet db1164.eqiad.wmnet db1165.eqiad.wmnet db1166.eqiad.wmnet db1167.eqiad.wmnet db1168.eqiad.wmnet db1169.eqiad.wmnet db1170.eqiad.wmnet db1171.eqiad.wmnet db1172.eqiad.wmnet db1173.eqiad.wmnet db1174.eqiad.wmnet db1175.eqiad.wmnet 
[00:49:51] <stashbot>	 T267043: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043
[00:49:59] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 392600 and 83 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:50:00] <mutante>	 gotcha, nod
[00:50:01] <robh>	 i wasnt sure how well regex would work
[00:50:12] <robh>	 and that was super easy to macro out into a text to cpoy paste 
[00:51:26] <robh>	 bleh, i hope it all passes i dont wanna work late!
[00:51:41] <robh>	 i dont understand why im seeing cumin puppet runs in output
[00:51:50] <robh>	 00:51:21 | cumin1001.eqiad.wmnet | Puppet run completed 
[00:51:54] <robh>	 that seems... not right.
[00:52:01] <mutante>	 right, i did it slightly diferent from both options and split my terminal into 4 and run the reimage script multiple times with one host each, in parallel
[00:52:59] <robh>	 so i dont see anything iearlier in output but seeing puppet run calls for the cumin host the script runs on isnt something ive noticed in the past
[00:53:16] <robh>	 but if its just puppet runs it wont matter, perhaps its parsing all the log files and includes local, not really sure wtf is up with that
[00:53:51] <robh>	 maybe it needs to run on cumin once host is done to pull puppet host keys... thats likely it but not sure
[00:53:55] <mutante>	 eh.. it is normal that it starts a puppet run on the host, but I don't see that on cumin1001
[00:54:10] <mutante>	 ah
[00:55:20] <mutante>	 00:54:43 | mw2374.codfw.wmnet | Polling until a Puppet sign request appears
[00:55:23] <mutante>	 00:54:47 | mw2374.codfw.wmnet | Signed Puppet cert
[00:55:29] <robh>	 yeah i hadnt noticed it before but it likely was always doing it
[00:55:31] <icinga-wm>	 PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:55:35] <mutante>	 this is the expected part that I see when it gets to the first run
[00:55:52] <robh>	 oh man this is so many hosts, ill split this in half next time
[00:55:53] <mutante>	 sounds reasonable
[00:56:02] <robh>	 there is too many hosts flying by to monitor errors in real time
[00:56:13] <robh>	 so ill have to parse for failure from report at end.
[00:56:42] <mutante>	 I stuck to 4 at a time for that reason, it seemed to be the right amount to still watch on the side while resonably quick
[00:56:50] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2374.codfw.wmnet with reason: REIMAGE
[00:56:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:57:21] <mutante>	 every once in a while there were special cases, for example remote IPMI needed fix or a bug to report in netbox script
[00:57:41] <mutante>	 you can also look at logs afterwards though
[00:58:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1166.eqiad.wmnet', 'db1164.eqiad.wmnet', 'db1170.eqiad.wmnet', 'db1162.eqiad.wmnet', 'db1160.eqiad.wmnet', 'db1...
[00:58:18] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2368.codfw.wmnet with reason: REIMAGE
[00:58:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:58:31] <mutante>	 well, if it's all brandnew then i'd go higher
[00:58:52] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2374.codfw.wmnet with reason: REIMAGE
[00:58:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:12] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2366.codfw.wmnet with reason: REIMAGE
[00:59:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:05] <wikibugs>	 (03PS3) 10Dzahn: Remove obsolete role installserver::apt [puppet] - 10https://gerrit.wikimedia.org/r/657569 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff)
[01:00:41] <Urbanecm>	 !log Evening B&C still in process, waiting on Zuul
[01:00:43] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1804276424 and 122 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:48] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27585/" [puppet] - 10https://gerrit.wikimedia.org/r/657569 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff)
[01:00:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7404324880 and 482 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:01:02] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2368.codfw.wmnet with reason: REIMAGE
[01:01:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:01:41] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2376.codfw.wmnet'] `  an...
[01:02:27] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2376.codfw.wmnet
[01:02:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:03:00] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2366.codfw.wmnet with reason: REIMAGE
[01:03:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:38] <wikibugs>	 (03Merged) 10jenkins-bot: Distinguish between null continue value and unknown one [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657623 (https://phabricator.wikimedia.org/T272548) (owner: 10Anne Tomasevich)
[01:05:40] <wikibugs>	 (03Merged) 10jenkins-bot: A/B test output when a specific feature is being tested [extensions/DiscussionTools] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657653 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch)
[01:05:42] <wikibugs>	 (03Merged) 10jenkins-bot: Don't return the status of doBlockInternal when processing block actions [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657654 (owner: 10Urbanecm)
[01:05:48] <Urbanecm>	 \o/
[01:05:54] <Urbanecm>	 annet: Kemayo: are you around?
[01:05:58] <annet>	 yep!
[01:05:58] <Kemayo>	 🎉
[01:06:05] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2376.codfw.wmnet
[01:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:06:14] <Urbanecm>	 ok, gimme a while to pull it all to debug servers :)
[01:06:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH)
[01:07:15] <Urbanecm>	 annet: Kemayo: Daimona_: your backports are available at mwdebug1001, please test :)
[01:07:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) > Of which those **FAILED**: > ` > ['db1159.eqiad.wmnet', 'db1171.eqiad.wmnet', 'db1172.eqiad.wmnet', 'db1173.eqiad.wmnet', 'db1175.eqiad.wmnet'] > `  I've updated...
[01:08:02] <Kemayo>	 Urbanecm: Mine'll just be verifying nothing is broken, until the accompanying config patch goes up.
[01:08:16] <Urbanecm>	 ah, I'll pull the config patch as well then, gimme a second
[01:08:38] <annet>	 Urbanecm: testing...
[01:08:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch)
[01:08:47] <wikibugs>	 (03CR) 10Urbanecm: Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch)
[01:08:51] <wikibugs>	 (03PS2) 10Urbanecm: Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch)
[01:08:56] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch)
[01:09:17] <Daimona_>	 Testing
[01:09:34] <Kemayo>	 That said, I do confirm that it has no obvious breaking effects *without* the config, which is always worth checking. :D
[01:09:54] <annet>	 Urbanecm: all looks good!
[01:10:04] <Urbanecm>	 thanks annet, will sync
[01:10:11] <Urbanecm>	 Kemayo: great, thanks :)
[01:10:11] <annet>	 thanks :)
[01:10:27] <wikibugs>	 (03PS2) 10Dzahn: admin: update SSH key for Volker_E [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628)
[01:10:53] <wikibugs>	 (03Merged) 10jenkins-bot: Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch)
[01:11:22] <Urbanecm>	 Kemayo: config patch fetched to mwdebug1001 as well
[01:12:05] <Kemayo>	 Urbanecm: Okay, looks good.
[01:12:15] <Urbanecm>	 thanks, will sync Kemayo 
[01:12:38] <Kemayo>	 Took me a minute to hunt down what the Indonesian for "reply" was to verify it, which maybe I should have prepped better for. >_>
[01:12:38] <Urbanecm>	 i assume it should be synced "backport first, config second", or should it be the other way around Kemayo ?
[01:12:38] <Daimona_>	 Urbanecm: Tested, works: https://www.mediawiki.org/w/index.php?title=Special:AbuseLog&wpSearchFilter=64
[01:12:57] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/WikibaseMediaInfo/: 4b0259b761681ca90b3f3039019553ddca40a5fe: Distinguish between null continue value and unknown one (T272548) (duration: 00m 59s)
[01:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:13:00] <Kemayo>	 Urbanecm: Yeah, backport first, please.
[01:13:01] <stashbot>	 T272548: Adding a filter then changing tabs yields no results - https://phabricator.wikimedia.org/T272548
[01:13:04] <Urbanecm>	 Kemayo: ?uselang=en in your URL should fix that for you 🙂
[01:13:07] <Urbanecm>	 Kemayo: ack
[01:13:11] <Urbanecm>	 annet: should be live :)
[01:13:43] <annet>	 Urbanecm: confirmed, thanks very much!
[01:14:01] <Urbanecm>	 no problem annet :)
[01:14:28] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 70344 and 255 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:14:42] <Kemayo>	 Though really, it's fairly tightly focused, so even if it's out of order it'll only cause an issue for 50% of logged in people on talk pages... if they're looking for something they don't normally expect to be there.
[01:14:52] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/DiscussionTools/: 513a7861bbcf06a8ac5c29e1b9838640cbd7c628: A/B test output when a specific feature is being tested (T268191) (duration: 00m 55s)
[01:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:14:56] <stashbot>	 T268191: Implement A/B test bucketing - https://phabricator.wikimedia.org/T268191
[01:15:30] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 12696 and 315 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:15:48] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] admin: update SSH key for Volker_E (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) (owner: 10Dzahn)
[01:16:39] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 376cba1b33dd68d40490a1498c59a4d430318ab1: Enroll idwiki in the DiscussionTools a/b test (T268191) (duration: 00m 55s)
[01:16:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:16:42] <Urbanecm>	 Kemayo: i prefer to test everything that can reasonably be tested, humans make mistakes, and I don't want to...break all of DiscussionTools, in the worst case 🙂
[01:16:49] <Urbanecm>	 anyway, should be live now
[01:17:07] <Kemayo>	 Urbanecm: Yup, looks all good on live. Thanks!
[01:17:49] <Urbanecm>	 no problem :)
[01:18:43] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/AbuseFilter/: 7d8ab70d5b00142e8344e242dd085eb7bfa81145: Dont return the status of doBlockInternal when processing block actions (duration: 00m 59s)
[01:18:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:18:51] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2374.codfw.wmnet'] `  an...
[01:19:01] <Urbanecm>	 Daimona_: synced yours as well :)
[01:19:20] <Urbanecm>	 so, we should be done
[01:19:25] <Urbanecm>	 !log Evening B&C window finished
[01:19:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:23] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2368.codfw.wmnet'] `  an...
[01:20:25] <Daimona_>	 Hooray! Thank you!
[01:20:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment.eqiad.wmnet/Design Style Guide for VolkerE - https://phabricator.wikimedia.org/T272628 (10Dzahn) @jcrespo @Muehlenhoff or anyone. This ticket is much easier than it looks. It's existing access and only an update of an existing...
[01:21:29] <brennen>	 thanks all for the assistance earlier.
[01:21:29] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2366.codfw.wmnet'] `  an...
[01:21:51] * brennen steps away from computer.
[01:22:03] <wikibugs>	 (03PS3) 10Dzahn: admin: update SSH key for Volker_E [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628)
[01:22:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2374.codfw.wmnet
[01:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:22:22] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] mariadb: grant user 'phstats' additional select on phabricator_policy db [puppet] - 10https://gerrit.wikimedia.org/r/657692 (https://phabricator.wikimedia.org/T272654) (owner: 10Aklapper)
[01:22:29] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2366.codfw.wmnet
[01:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:22:37] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2368.codfw.wmnet
[01:22:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:24:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good to me, you should ask marostegui /dba to deploy these though" [puppet] - 10https://gerrit.wikimedia.org/r/657692 (https://phabricator.wikimedia.org/T272654) (owner: 10Aklapper)
[01:25:43] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2374.codfw.wmnet
[01:25:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:25:59] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2366.codfw.wmnet
[01:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:26:09] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2368.codfw.wmnet
[01:26:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:30:58] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:35:42] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:36:18] <mutante>	 off - no more reimages for now
[02:10:46] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10AntiCompositeNumber) Amazon has announced plans to fork Elasticsearch and Kibana under the original Apache 2.0 license (...
[02:21:26] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[02:21:48] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[02:27:32] <wikibugs>	 (03PS4) 10CRusnov: interface_automation: Clean up old interfaces on run [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199
[02:27:34] <wikibugs>	 (03CR) 10CRusnov: interface_automation: Clean up old interfaces on run (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[02:29:07] <wikibugs>	 (03CR) 10CRusnov: "Thank you for reviewing as always 😊" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[02:30:20] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:36:42] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:44:43] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova vendordata: fix initial apt repo for non-Buster [puppet] - 10https://gerrit.wikimedia.org/r/657720 (https://phabricator.wikimedia.org/T271273)
[02:45:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova vendordata: fix initial apt repo for non-Buster [puppet] - 10https://gerrit.wikimedia.org/r/657720 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[02:51:02] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[02:51:28] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[03:09:16] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova vendordata: move a bunch of file writes from boot script to cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657721 (https://phabricator.wikimedia.org/T271273)
[03:20:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:22:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:31:52] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:38:34] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:11:43] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10Tgr) Apparently there is a more community-oriented fork in the works, too: https://logz.io/blog/open-source-elasticsearc...
[04:19:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:21:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:31:34] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:38:12] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:46:26] <wikibugs>	 (03PS1) 10Andrew Bogott: openldap_clouddev: fix for new acme ca [puppet] - 10https://gerrit.wikimedia.org/r/657723
[04:48:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openldap_clouddev: fix for new acme ca [puppet] - 10https://gerrit.wikimedia.org/r/657723 (owner: 10Andrew Bogott)
[04:59:44] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova vendordata: move a bunch of file writes from boot script to cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657721 (https://phabricator.wikimedia.org/T271273)
[05:07:32] <wikibugs>	 (03PS3) 10Andrew Bogott: Nova vendordata: move a bunch of file writes from boot script to cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657721 (https://phabricator.wikimedia.org/T271273)
[05:31:44] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:38:24] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:43:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db1118 weight', diff saved to https://phabricator.wikimedia.org/P13883 and previous config saved to /var/cache/conftool/dbconfig/20210122-054330-marostegui.json
[05:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:52] <wikibugs>	 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Marostegui) 05Open→03Resolved a:03Marostegui Change has been applied - thanks daniel for working out the patch!
[05:51:56] <wikibugs>	 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Marostegui) Actually the original patch creator was @Aklapper so thank you too! :)
[05:58:58] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (releases2002), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:00:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2142 into x2 as codfw master T269324', diff saved to https://phabricator.wikimedia.org/P13884 and previous config saved to /var/cache/conftool/dbconfig/20210122-060007-marostegui.json
[06:00:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:11] <stashbot>	 T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324
[06:01:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2143 and db2144 as x2 codfw slaves T269324', diff saved to https://phabricator.wikimedia.org/P13885 and previous config saved to /var/cache/conftool/dbconfig/20210122-060147-marostegui.json
[06:01:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:16:03] <wikibugs>	 (03PS1) 10Marostegui: db2133,db2078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657725 (https://phabricator.wikimedia.org/T272614)
[06:16:25] <marostegui>	 !log Stop MySQL on db1117 db2133 db2078 T272614
[06:16:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:16:28] <stashbot>	 T272614: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614
[06:17:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2133,db2078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657725 (https://phabricator.wikimedia.org/T272614) (owner: 10Marostegui)
[06:17:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:18:45] <marostegui>	 proxy alerts will arrive to irc
[06:19:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:23:16] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[06:23:22] <marostegui>	 ^ expected
[06:23:32] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 0 down 3 https://wikitech.wikimedia.org/wiki/HAProxy
[06:24:40] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[06:26:38] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui maintenance ongoing https://wikitech.wikimedia.org/wiki/HAProxy
[06:26:38] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 0 down 3 Marostegui maintenance ongoing https://wikitech.wikimedia.org/wiki/HAProxy
[06:26:58] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui maintenance ongoing https://wikitech.wikimedia.org/wiki/HAProxy
[06:31:34] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:38:16] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:45:01] <ryankemper>	 !log [wdqs] re-pooled `wdqs1013` (all caught up on lag)
[06:45:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:45] <ryankemper>	 !log [WDQS Deploy] All tests passing on canary `wdqs1003` before WDQS deploy, beginning deploy
[06:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:52] <logmsgbot>	 !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@70f9d37]: 0.3.60
[06:46:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:03] <ryankemper>	 !log [WDQS Deploy] All tests passing on canary `wdqs1003` following canary WDQS deploy, proceeding to rest of fleet
[06:50:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:35] <logmsgbot>	 !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@70f9d37]: 0.3.60 (duration: 10m 43s)
[06:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:30] <ryankemper>	 !log [WDQS Deploy] Initial deploy complete, `query.wikidata.org` handles queries fine, proceeding to post-deploy steps
[06:58:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:19] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[06:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:22] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[06:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:27] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[06:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:58] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[07:03:54] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[07:04:53] <wikibugs>	 (03CR) 10Marostegui: "I am ok with this, but up to Luca and the Analytics Team :)" [puppet] - 10https://gerrit.wikimedia.org/r/655093 (https://phabricator.wikimedia.org/T269211) (owner: 10Bstorm)
[07:20:06] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:21:36] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:23:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui) @RobH it looks like db1163 has RAID0 instead of RAID10: ` root@db1163:~# megacli -LdPdInfo -a0  Adapter #0  Number of Virtual Disks: 1 Virtual Drive: 0 (Targe...
[07:26:40] <ryankemper>	 !log [WDQS Deploy] WDQS deploy complete; service is healthy
[07:26:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:30:24] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:30:28] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:30:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] wikireplicas: remove query killer from dedicated replica server [puppet] - 10https://gerrit.wikimedia.org/r/655093 (https://phabricator.wikimedia.org/T269211) (owner: 10Bstorm)
[07:30:42] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 2, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:35:01] <wikibugs>	 (03PS6) 10Elukey: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[07:35:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 2, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:36:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:36:56] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:37:00] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:38:32] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment.eqiad.wmnet/Design Style Guide for VolkerE - https://phabricator.wikimedia.org/T272628 (10Volker_E) Thanks for observing @Dzahn!
[07:41:26] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:42:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:44:08] <wikibugs>	 (03PS1) 10Joal: profile::analytics::refinery::job::druid_load Bump jar version [puppet] - 10https://gerrit.wikimedia.org/r/657764 (https://phabricator.wikimedia.org/T271560)
[07:44:35] <joal>	 elukey: --^
[07:51:15] <wikibugs>	 10SRE, 10Traffic, 10serviceops: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10JMeybohm) a:03JMeybohm Unfortunately, upstream was not very responsive on my question about adding `Cache-Control` (https://github.com/helm/chartmuseum/issues/36...
[07:52:08] <wikibugs>	 (03CR) 10Elukey: "There is a similar profile with "test" in the name, that runs in hadoop test, can you also update it? After that I'll merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/657764 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[07:53:03] <wikibugs>	 (03PS1) 10Ryan Kemper: wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851)
[07:55:44] <wikibugs>	 (03PS4) 10Elukey: Initial configuration of the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411)
[07:55:51] <wikibugs>	 (03PS1) 10Joal: Lower druid-public historical datasource backups [puppet] - 10https://gerrit.wikimedia.org/r/657766 (https://phabricator.wikimedia.org/T272670)
[07:57:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Initial configuration of the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[07:58:53] <wikibugs>	 (03PS2) 10Joal: profile::analytics::refinery::job::druid_load Bump jar version [puppet] - 10https://gerrit.wikimedia.org/r/657764 (https://phabricator.wikimedia.org/T271560)
[07:59:24] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27588/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210122T0800)
[08:04:50] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:06:28] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:11:47] <wikibugs>	 (03PS2) 10Ryan Kemper: wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851)
[08:11:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:13:21] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27591/console" [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[08:13:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:13:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[08:15:02] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for netbox/apache [puppet] - 10https://gerrit.wikimedia.org/r/656187 (https://phabricator.wikimedia.org/T135991)
[08:15:49] <wikibugs>	 (03PS3) 10Ryan Kemper: wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851)
[08:16:35] <wikibugs>	 (03CR) 10Ryan Kemper: "One thing I'm not sure of:" [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[08:16:42] <wikibugs>	 (03PS1) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411)
[08:17:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[08:18:12] <wikibugs>	 (03PS2) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411)
[08:20:08] <wikibugs>	 (03CR) 10Ryan Kemper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[08:21:01] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27593/console" [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[08:23:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable managed adduser.conf unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/657770 (https://phabricator.wikimedia.org/T235162)
[08:26:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] config: allow using ~ for cookbook path [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 (owner: 10David Caro)
[08:26:33] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] gitignore: add vim swap files [software/spicerack] - 10https://gerrit.wikimedia.org/r/657609 (owner: 10David Caro)
[08:26:51] <wikibugs>	 (03CR) 10David Caro: [V: 03+2 C: 03+2] config: allow using ~ for cookbook path [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 (owner: 10David Caro)
[08:27:00] <wikibugs>	 (03CR) 10David Caro: [V: 03+2 C: 03+2] gitignore: add vim swap files [software/spicerack] - 10https://gerrit.wikimedia.org/r/657609 (owner: 10David Caro)
[08:27:38] <wikibugs>	 (03PS1) 10Ayounsi: Remove BGP for Zayo transit in ulsfo, eqiad, eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657771 (https://phabricator.wikimedia.org/T264772)
[08:28:18] <wikibugs>	 (03PS1) 10JMeybohm: cache_text: Disable caching for helm-charts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/657772 (https://phabricator.wikimedia.org/T272633)
[08:28:26] <wikibugs>	 (03CR) 10David Caro: "Hmm... I think I should not have submitted, it seems that jenkins will..." [software/spicerack] - 10https://gerrit.wikimedia.org/r/657609 (owner: 10David Caro)
[08:30:58] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff)
[08:31:07] <wikibugs>	 (03CR) 10Ema: [C: 03+1] cache_text: Disable caching for helm-charts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/657772 (https://phabricator.wikimedia.org/T272633) (owner: 10JMeybohm)
[08:31:38] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:33:48] <elukey>	 !log update puppet compiler's facts 
[08:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove BGP for Zayo transit in ulsfo, eqiad, eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657771 (https://phabricator.wikimedia.org/T264772) (owner: 10Ayounsi)
[08:35:12] <XioNoX>	 !log Remove BGP for Zayo transit in ulsfo, eqiad, eqord
[08:35:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:46] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:41:31] <wikibugs>	 (03Merged) 10jenkins-bot: Remove BGP for Zayo transit in ulsfo, eqiad, eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657771 (https://phabricator.wikimedia.org/T264772) (owner: 10Ayounsi)
[08:42:57] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27597/console" [puppet] - 10https://gerrit.wikimedia.org/r/657772 (https://phabricator.wikimedia.org/T272633) (owner: 10JMeybohm)
[08:43:31] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] cache_text: Disable caching for helm-charts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/657772 (https://phabricator.wikimedia.org/T272633) (owner: 10JMeybohm)
[08:44:37] <moritzm>	 !log installing mutt updates for stretch
[08:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:39] <moritzm>	 !log installing PIP security updates for stretch
[08:49:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:43] <wikibugs>	 10SRE, 10serviceops: Update php-xdebug to 2.9.2 in apt.wm.o component/php72 - https://phabricator.wikimedia.org/T244716 (10hashar)
[08:52:45] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "See comments inline" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[08:52:51] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10amy_rc) @jcrespo My internship started on 01.01.2021 and ends on 31.03.2021.
[08:53:20] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::druid_load Bump jar version [puppet] - 10https://gerrit.wikimedia.org/r/657764 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[08:53:20] <wikibugs>	 10SRE, 10FR-MW-Vagrant, 10Fundraising-Backlog, 10MediaWiki-Vagrant, 10Patch-For-Review: Package XDebug 2.9 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10hashar)
[08:56:48] <wikibugs>	 (03PS7) 10Elukey: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[08:59:02] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] wcqs: create wcqs microsite && move gui (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[09:00:55] <wikibugs>	 (03PS1) 10Gehel: query_service: remove monitoring check for UI. [puppet] - 10https://gerrit.wikimedia.org/r/657773 (https://phabricator.wikimedia.org/T271851)
[09:03:25] <wikibugs>	 (03PS3) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411)
[09:03:27] <wikibugs>	 (03PS1) 10Elukey: Move hiera config for Hadoop Backup to the correct location [puppet] - 10https://gerrit.wikimedia.org/r/657774 (https://phabricator.wikimedia.org/T260411)
[09:04:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10jcrespo)
[09:05:24] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10jcrespo) @Volker_E would it be possible to verify your identity and key ownership on a videocall with your manager?
[09:05:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10jcrespo) p:05Triage→03High
[09:05:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Lower druid-public historical datasource backups [puppet] - 10https://gerrit.wikimedia.org/r/657766 (https://phabricator.wikimedia.org/T272670) (owner: 10Joal)
[09:08:13] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] query_service: remove monitoring check for UI. [puppet] - 10https://gerrit.wikimedia.org/r/657773 (https://phabricator.wikimedia.org/T271851) (owner: 10Gehel)
[09:08:25] <wikibugs>	 (03CR) 10Hashar: "That change broke Puppet on deploy-1002.devtools.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) (owner: 10Ottomata)
[09:08:35] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] query_service: remove monitoring check for UI. [puppet] - 10https://gerrit.wikimedia.org/r/657773 (https://phabricator.wikimedia.org/T271851) (owner: 10Gehel)
[09:13:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) Thanks, amy_rc, I will put that as the time to communicate with your manager to review your access status by then :-)
[09:19:46] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10Aklapper)
[09:20:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[09:21:47] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] wcqs: create wcqs microsite && move gui (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper)
[09:27:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Move hiera config for Hadoop Backup to the correct location [puppet] - 10https://gerrit.wikimedia.org/r/657774 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[09:27:46] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[09:28:02] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[09:28:39] <wikibugs>	 (03PS4) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411)
[09:30:48] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:34:50] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[09:36:36] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:43:37] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Temporarily add db1088 to api group T271106', diff saved to https://phabricator.wikimedia.org/P13887 and previous config saved to /var/cache/conftool/dbconfig/20210122-094337-kormat.json
[09:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:41] <stashbot>	 T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106
[09:44:34] <wikibugs>	 (03PS16) 10MSantos: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949)
[09:44:48] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1093.eqiad.wmnet with reason: Rebooting for T272255
[09:44:48] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1093.eqiad.wmnet with reason: Rebooting for T272255
[09:44:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:53] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1093 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13888 and previous config saved to /var/cache/conftool/dbconfig/20210122-094453-kormat.json
[09:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:59] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13889 and previous config saved to /var/cache/conftool/dbconfig/20210122-095058-kormat.json
[09:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:34] <moritzm>	 !log uploaded cairo 1.14.0-2.1+deb8u2+wmf1 to apt.wikimedia.org
[09:52:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:24] <wikibugs>	 (03PS1) 10Ayounsi: Add Lumen transit BGP in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/657777 (https://phabricator.wikimedia.org/T269808)
[09:54:06] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2133,db2078: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657657
[09:55:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2133,db2078: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657657 (owner: 10Marostegui)
[10:02:34] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Temporarily add db1110 to api group T272255', diff saved to https://phabricator.wikimedia.org/P13890 and previous config saved to /var/cache/conftool/dbconfig/20210122-100233-kormat.json
[10:02:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:03] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1130.eqiad.wmnet with reason: Rebooting for T272255
[10:03:03] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1130.eqiad.wmnet with reason: Rebooting for T272255
[10:03:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:08] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1130 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13891 and previous config saved to /var/cache/conftool/dbconfig/20210122-100307-kormat.json
[10:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:35] <wikibugs>	 10SRE, 10vm-requests: eqiad/codfw: 1 VMs requested for puppetboard - https://phabricator.wikimedia.org/T272683 (10MoritzMuehlenhoff)
[10:06:02] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13892 and previous config saved to /var/cache/conftool/dbconfig/20210122-100602-kormat.json
[10:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:35] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13893 and previous config saved to /var/cache/conftool/dbconfig/20210122-100734-kormat.json
[10:07:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:08] <wikibugs>	 (03PS1) 10Elukey: Add fake secrets for the Hadoop backup cluster [labs/private] - 10https://gerrit.wikimedia.org/r/657779
[10:08:11] <wikibugs>	 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Aklapper) 05Resolved→03Open >>! In T272654#6767902, @Marostegui wrote: > Change has been applied  (Thanks everyone.) Hmm, https://gerrit.wikim...
[10:08:43] <wikibugs>	 (03CR) 10ArielGlenn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man)
[10:08:45] <wikibugs>	 (03PS2) 10Elukey: Add fake secrets for the Hadoop backup cluster [labs/private] - 10https://gerrit.wikimedia.org/r/657779
[10:09:00] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake secrets for the Hadoop backup cluster [labs/private] - 10https://gerrit.wikimedia.org/r/657779 (owner: 10Elukey)
[10:11:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Thank you daniel for the patch, I verified on video call Volker's identity through his manager Lucy and he confirmed this is his new key." [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) (owner: 10Dzahn)
[10:13:46] <wikibugs>	 (03PS3) 10Aklapper: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) (owner: 10A2569875)
[10:15:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[10:16:05] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host archiva1002.wikimedia.org
[10:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:12] <wikibugs>	 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Marostegui) 05Open→03Resolved I forgot the m3-slave CNAME uses the hostname directly instead of the proxy, which is not nice but we can fix th...
[10:18:34] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host archiva1002.wikimedia.org
[10:18:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:06] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13894 and previous config saved to /var/cache/conftool/dbconfig/20210122-102105-kormat.json
[10:21:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:20] <wikibugs>	 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Aklapper) Yes, works now! Does that mean https://gerrit.wikimedia.org/r/c/operations/puppet/+/657692/ should be closed or abandoned or so? Thanks! <3
[10:22:38] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13895 and previous config saved to /var/cache/conftool/dbconfig/20210122-102237-kormat.json
[10:22:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:00] <wikibugs>	 (03PS1) 10David Caro: wmcs.enc: added a small cli to be able to use the enc [puppet] - 10https://gerrit.wikimedia.org/r/657780 (https://phabricator.wikimedia.org/T267412)
[10:24:10] <wikibugs>	 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10jcrespo)
[10:27:15] <wikibugs>	 (03PS1) 10Elukey: Add fake keytabs for the new hadoop worker nodes [labs/private] - 10https://gerrit.wikimedia.org/r/657781
[10:27:23] <wikibugs>	 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10jcrespo) I reached this through alerts of backups of releases2002 not working since 2021-01-21. I will disable alerts for this host until fixed (please ping me to re enabling them when fixed).
[10:27:51] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake keytabs for the new hadoop worker nodes [labs/private] - 10https://gerrit.wikimedia.org/r/657781 (owner: 10Elukey)
[10:29:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Adapt proxy setting in debmonitor nginx site for CAS [puppet] - 10https://gerrit.wikimedia.org/r/657782
[10:29:29] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27611/console" [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[10:30:35] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:30:38] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Ignore releases2002 backup errors until vm issues are fixed [puppet] - 10https://gerrit.wikimedia.org/r/657783 (https://phabricator.wikimedia.org/T272555)
[10:30:59] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27612/console" [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[10:32:00] <wikibugs>	 (03PS4) 10Jcrespo: admin: update SSH key for Volker_E [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) (owner: 10Dzahn)
[10:33:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] admin: update SSH key for Volker_E [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) (owner: 10Dzahn)
[10:36:01] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:36:05] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10jcrespo)
[10:36:09] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13897 and previous config saved to /var/cache/conftool/dbconfig/20210122-103609-kormat.json
[10:36:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:41] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13898 and previous config saved to /var/cache/conftool/dbconfig/20210122-103741-kormat.json
[10:37:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:03] <wikibugs>	 (03PS5) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411)
[10:38:05] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::worker: make client tools optional [puppet] - 10https://gerrit.wikimedia.org/r/657784 (https://phabricator.wikimedia.org/T260411)
[10:39:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10jcrespo) 05Open→03Resolved a:03jcrespo @Volker_E They new key has been just deployed, it will take up to 30 minutes more or less to be deployed everywhere. Ple...
[10:40:48] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27613/console" [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[10:42:02] <wikibugs>	 (03PS2) 10Elukey: profile::hadoop::worker: make client tools optional [puppet] - 10https://gerrit.wikimedia.org/r/657784 (https://phabricator.wikimedia.org/T260411)
[10:42:04] <wikibugs>	 (03PS6) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411)
[10:42:06] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Ignore releases2002 backup errors until vm issues are fixed [puppet] - 10https://gerrit.wikimedia.org/r/657783 (https://phabricator.wikimedia.org/T272555)
[10:43:15] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27614/console" [puppet] - 10https://gerrit.wikimedia.org/r/657784 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[10:44:14] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::hadoop::worker: make client tools optional [puppet] - 10https://gerrit.wikimedia.org/r/657784 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[10:44:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Ignore releases2002 backup errors until vm issues are fixed [puppet] - 10https://gerrit.wikimedia.org/r/657783 (https://phabricator.wikimedia.org/T272555) (owner: 10Jcrespo)
[10:45:23] <jynus>	 elukey, not sure if we merged at the same time or your change went through?
[10:45:40] <jynus>	 it merged already at prod, so no issue
[10:46:11] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff)
[10:46:17] <elukey>	 jynus: I think so, I saw only my change and it didn't lead to errors afaics
[10:46:23] <jynus>	 indeed :-)
[10:48:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, one minor comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657780 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro)
[10:48:51] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[10:49:04] <jynus>	 yay
[10:49:32] <jynus>	 if you can fix a problem, just hide it ^ (I am joking)
[10:50:29] <wikibugs>	 (03PS1) 10Jcrespo: Revert "bacula: Ignore releases2002 backup errors until vm issues are fixed" [puppet] - 10https://gerrit.wikimedia.org/r/657659
[10:51:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "To be deployed before closing T272555" [puppet] - 10https://gerrit.wikimedia.org/r/657659 (owner: 10Jcrespo)
[10:52:45] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13899 and previous config saved to /var/cache/conftool/dbconfig/20210122-105244-kormat.json
[10:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:45] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db1088 from api group T271106', diff saved to https://phabricator.wikimedia.org/P13900 and previous config saved to /var/cache/conftool/dbconfig/20210122-105345-kormat.json
[10:53:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:49] <stashbot>	 T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106
[10:54:35] <wikibugs>	 (03PS7) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411)
[10:56:31] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1134.eqiad.wmnet with reason: Rebooting for T272255
[10:56:32] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1134.eqiad.wmnet with reason: Rebooting for T272255
[10:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:37] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1134 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13901 and previous config saved to /var/cache/conftool/dbconfig/20210122-105636-kormat.json
[10:56:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:05] <wikibugs>	 (03PS1) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785
[10:57:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[10:58:47] <wikibugs>	 (03PS2) 10David Caro: wmcs.enc: added a small cli to be able to use the enc [puppet] - 10https://gerrit.wikimedia.org/r/657780 (https://phabricator.wikimedia.org/T267412)
[10:59:21] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Temporarily add db1127 to api group T272255', diff saved to https://phabricator.wikimedia.org/P13902 and previous config saved to /var/cache/conftool/dbconfig/20210122-105921-kormat.json
[10:59:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:47] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1136.eqiad.wmnet with reason: Rebooting for T272255
[10:59:47] <wikibugs>	 (03CR) 10David Caro: wmcs.enc: added a small cli to be able to use the enc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657780 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro)
[10:59:47] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1136.eqiad.wmnet with reason: Rebooting for T272255
[10:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:52] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1136 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13903 and previous config saved to /var/cache/conftool/dbconfig/20210122-105952-kormat.json
[10:59:54] <wikibugs>	 (03PS2) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785
[10:59:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[11:00:25] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.enc: added a small cli to be able to use the enc [puppet] - 10https://gerrit.wikimedia.org/r/657780 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro)
[11:01:12] <wikibugs>	 (03PS3) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785
[11:01:32] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13904 and previous config saved to /var/cache/conftool/dbconfig/20210122-110132-kormat.json
[11:01:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[11:01:55] <wikibugs>	 (03PS4) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785
[11:02:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[11:02:25] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1141.eqiad.wmnet with reason: Rebooting for T272255
[11:02:25] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1141.eqiad.wmnet with reason: Rebooting for T272255
[11:02:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:30] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1141 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13905 and previous config saved to /var/cache/conftool/dbconfig/20210122-110229-kormat.json
[11:02:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:55] <jbond42>	 !log deploy cairo updates to jessie
[11:05:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:03] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13906 and previous config saved to /var/cache/conftool/dbconfig/20210122-110603-kormat.json
[11:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:36] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27615/" [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff)
[11:09:03] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader2001.wikimedia.org
[11:09:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:45] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2001.wikimedia.org
[11:11:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:47] <wikibugs>	 (03PS5) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785
[11:12:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[11:13:07] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27616/console" [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[11:13:24] <vgutierrez>	 sigh... that should be silent :)
[11:15:01] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13908 and previous config saved to /var/cache/conftool/dbconfig/20210122-111500-kormat.json
[11:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:06] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader1001.wikimedia.org
[11:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:36] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13909 and previous config saved to /var/cache/conftool/dbconfig/20210122-111635-kormat.json
[11:16:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:50] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1001.wikimedia.org
[11:17:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:13] <wikibugs>	 (03PS8) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[11:19:25] <wikibugs>	 (03CR) 10Jbond: sre: convert the generic reboot functions to the cookbook class API (0313 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[11:19:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[11:20:23] <wikibugs>	 (03PS9) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[11:20:36] <wikibugs>	 (03PS11) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139
[11:21:07] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13910 and previous config saved to /var/cache/conftool/dbconfig/20210122-112106-kormat.json
[11:21:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/657786
[11:23:43] <wikibugs>	 (03PS12) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139
[11:24:15] <hnowlan>	 !log joining restbase2009-a to cluster
[11:24:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:24] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'es1023 depooling: enable report_host T271106', diff saved to https://phabricator.wikimedia.org/P13911 and previous config saved to /var/cache/conftool/dbconfig/20210122-112424-kormat.json
[11:24:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:28] <stashbot>	 T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106
[11:24:43] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-a valid until 2022-06-15 10:34:42 +0000 (expires in 508 days) https://phabricator.wikimedia.org/T120662
[11:24:49] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:26:29] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mwdebug1003.eqiad.wmnet
[11:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:04] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13912 and previous config saved to /var/cache/conftool/dbconfig/20210122-113004-kormat.json
[11:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:47] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.117 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[11:31:07] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:39] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13913 and previous config saved to /var/cache/conftool/dbconfig/20210122-113139-kormat.json
[11:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:10] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13914 and previous config saved to /var/cache/conftool/dbconfig/20210122-113610-kormat.json
[11:36:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:19] <wikibugs>	 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10LSobanski) p:05Triage→03Medium Sounds like a good idea. Is this to address a specific concern that came up? One thing that comes to mind is the amount, re...
[11:36:34] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on es1023.eqiad.wmnet with reason: Reboot for T272121
[11:36:35] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1023.eqiad.wmnet with reason: Reboot for T272121
[11:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:58] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica IO: es5 #page on es1023 is CRITICAL: CRITICAL slave_io_state could not connect Kormat Forgot to DT host first... https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:36:59] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica Lag: es5 #page on es1023 is CRITICAL: CRITICAL slave_sql_lag could not connect Kormat Forgot to DT host first... https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:37:00] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica SQL: es5 #page on es1023 is CRITICAL: CRITICAL slave_sql_state could not connect Kormat Forgot to DT host first... https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:37:00] <icinga-wm>	 ACKNOWLEDGEMENT - mysqld processes #page on es1023 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Kormat Forgot to DT host first... https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[11:37:41] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:49] <wikibugs>	 (03PS1) 10Hnowlan: similar-users: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657789
[11:40:00] <wikibugs>	 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10jcrespo) > Is this to address a specific concern that came up?  ^@mark
[11:40:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "considering that you need to expose dbproxy1018 and dbproxy1019 I'd stick with two services, wikireplicas-web and wikireplicas-analytics" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[11:40:17] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1003.eqiad.wmnet
[11:40:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:43] <wikibugs>	 (03PS1) 10Elukey: role::analytics_backup_cluster::hadoop::master: remove analytics keytab [puppet] - 10https://gerrit.wikimedia.org/r/657790
[11:42:24] <revi>	 can someone give me this week's train ticket link?
[11:43:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_backup_cluster::hadoop::master: remove analytics keytab [puppet] - 10https://gerrit.wikimedia.org/r/657790 (owner: 10Elukey)
[11:43:10] <wikibugs>	 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10mark) It's purely an idea I've had for a long time, to make it immediately obvious to anyone logging in what is backed up, and what isn't. That should help to...
[11:44:51] <revi>	 nvm.
[11:45:08] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13915 and previous config saved to /var/cache/conftool/dbconfig/20210122-114507-kormat.json
[11:45:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:43] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13916 and previous config saved to /var/cache/conftool/dbconfig/20210122-114642-kormat.json
[11:46:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:03] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01032 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[11:49:05] <wikibugs>	 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10jcrespo) 2 notices: * This could only cover direct bacula backups (things that are indirectly backed up, like puppet or gerrit repos) or database and media ba...
[11:50:09] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on es1023.eqiad.wmnet with reason: Extended reboot for T272121
[11:50:10] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on es1023.eqiad.wmnet with reason: Extended reboot for T272121
[11:50:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:31] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:50:37] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:05] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:07] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:14] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13917 and previous config saved to /var/cache/conftool/dbconfig/20210122-115113-kormat.json
[11:51:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:27] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] similar-users: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657789 (owner: 10Hnowlan)
[11:53:01] <wikibugs>	 (03Merged) 10jenkins-bot: similar-users: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657789 (owner: 10Hnowlan)
[11:54:31] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[11:54:31] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[11:54:53] <icinga-wm>	 RECOVERY - Host releases2002 is UP: PING OK - Packet loss = 0%, RTA = 32.31 ms
[11:54:55] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[11:54:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:37] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[11:56:49] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1127 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:56:49] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:56:51] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[11:56:51] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[11:58:25] <elukey>	 sorrryyy this is me
[11:58:28] <elukey>	 new cluster, just downtimed
[11:59:03] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::worker: add spark2 back [puppet] - 10https://gerrit.wikimedia.org/r/657791
[11:59:18] <wikibugs>	 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10akosiaris) I think the following explains it:  ` root@releases2002:~# ip addr ls 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000     link/loopback 00:00:00:00:00:00...
[12:00:11] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13918 and previous config saved to /var/cache/conftool/dbconfig/20210122-120011-kormat.json
[12:00:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:40] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27617/console" [puppet] - 10https://gerrit.wikimedia.org/r/657791 (owner: 10Elukey)
[12:00:54] <wikibugs>	 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10akosiaris) 05Open→03Resolved a:03akosiaris Anyway, s/ens5/ens6/ in /etc/network/interfaces and the issue has been fixed. I was wondering whether it makes sense to invest time to "fix" this but having me...
[12:02:10] <wikibugs>	 (03Abandoned) 10Elukey: profile::hadoop::worker: add spark2 back [puppet] - 10https://gerrit.wikimedia.org/r/657791 (owner: 10Elukey)
[12:02:58] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Aklapper) >>! In T272489#6766201, @jcrespo wrote: > on the Engineering's handbook To avoid misunderstandings and to allow me to get a better overview of onboarding docs, which "Engineering's handbook...
[12:03:56] <wikibugs>	 (03PS1) 10Muehlenhoff: debmonitor: Also allow localhost [puppet] - 10https://gerrit.wikimedia.org/r/657793
[12:05:17] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) @Aklapper I pinned down the source of poorly documented request to https://office.wikimedia.org/w/index.php?title=Guide_for_new_engineering_staff&diff=prev&oldid=284690 and tried to fix it t...
[12:06:40] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man)
[12:08:48] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1135,1137].eqiad.wmnet
[12:08:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:29] <wikibugs>	 (03PS10) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[12:10:39] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1135,1137].eqiad.wmnet
[12:10:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:06] <wikibugs>	 10SRE, 10vm-requests: eqiad/codfw: 1 VMs requested for puppetboard - https://phabricator.wikimedia.org/T272683 (10jbond) Looks good to me
[12:11:47] <wikibugs>	 (03PS13) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139
[12:12:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[12:12:51] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[12:14:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond)
[12:16:33] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:39] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:19:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:26:42] <wikibugs>	 (03PS1) 10Muehlenhoff: debmonitor: Don't include debmonitor_static for the internal listener [puppet] - 10https://gerrit.wikimedia.org/r/657795
[12:30:44] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:33:03] <wikibugs>	 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10jcrespo) 05Resolved→03Open Let me reopen for reenabling backup monitoring (even if main issue has been fixed).  Thanks @akosiaris for the help here.
[12:33:24] <wikibugs>	 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10jcrespo) a:05akosiaris→03jcrespo
[12:33:39] <logmsgbot>	 !log volker-e@deploy1001 Started deploy [design/style-guide@9a811b8]: Deploy design/style-guide: 9a811b8 Add Language selectors to component overview Sketch document (#424)
[12:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:46] <logmsgbot>	 !log volker-e@deploy1001 Finished deploy [design/style-guide@9a811b8]: Deploy design/style-guide: 9a811b8 Add Language selectors to component overview Sketch document (#424) (duration: 00m 07s)
[12:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:56] <wikibugs>	 (03PS11) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[12:35:18] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 25%: Reboot T272121', diff saved to https://phabricator.wikimedia.org/P13919 and previous config saved to /var/cache/conftool/dbconfig/20210122-123518-kormat.json
[12:35:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:56] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:36:19] <wikibugs>	 (03PS14) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139
[12:38:31] <wikibugs>	 (03PS2) 10Jcrespo: Revert "bacula: Ignore releases2002 backup errors until vm issues are fixed" [puppet] - 10https://gerrit.wikimedia.org/r/657659
[12:38:33] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db1127 from api group T272255', diff saved to https://phabricator.wikimedia.org/P13920 and previous config saved to /var/cache/conftool/dbconfig/20210122-123832-kormat.json
[12:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:47] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host puppetboard1002.eqiad.wmnet
[12:38:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:50] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::backup::namenode: add dep to analytics-admin [puppet] - 10https://gerrit.wikimedia.org/r/657797
[12:39:42] <wikibugs>	 (03PS15) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139
[12:40:17] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "bacula: Ignore releases2002 backup errors until vm issues are fixed" [puppet] - 10https://gerrit.wikimedia.org/r/657659 (owner: 10Jcrespo)
[12:41:16] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27619/console" [puppet] - 10https://gerrit.wikimedia.org/r/657791 (owner: 10Elukey)
[12:41:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[12:42:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond)
[12:43:11] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db1110 from api group T272255', diff saved to https://phabricator.wikimedia.org/P13921 and previous config saved to /var/cache/conftool/dbconfig/20210122-124310-kormat.json
[12:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:03] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27620/console" [puppet] - 10https://gerrit.wikimedia.org/r/657797 (owner: 10Elukey)
[12:46:44] <wikibugs>	 (03PS12) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[12:46:51] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::hadoop::backup::namenode: add dep to analytics-admin [puppet] - 10https://gerrit.wikimedia.org/r/657797 (owner: 10Elukey)
[12:47:09] <wikibugs>	 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10jcrespo) 05Open→03Resolved a:05jcrespo→03akosiaris Backing up 0 bytes was unsurprisingly fast :-) Thanks again to both of you.
[12:47:43] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1149.eqiad.wmnet with reason: Rebooting for T272255
[12:47:44] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1149.eqiad.wmnet with reason: Rebooting for T272255
[12:47:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:49] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13922 and previous config saved to /var/cache/conftool/dbconfig/20210122-124748-kormat.json
[12:47:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[12:50:06] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.429 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[12:50:22] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 50%: Reboot T272121', diff saved to https://phabricator.wikimedia.org/P13923 and previous config saved to /var/cache/conftool/dbconfig/20210122-125021-kormat.json
[12:50:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Aklapper) Ah, thanks! <3
[12:52:24] <wikibugs>	 (03CR) 10Muehlenhoff: role::ml-serve: Add ml-serve machine role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[12:53:05] <wikibugs>	 (03PS1) 10Elukey: Use profile::standard::admin_groups in Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/657800
[12:53:06] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host puppetboard1002.eqiad.wmnet
[12:53:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Use profile::standard::admin_groups in Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/657800 (owner: 10Elukey)
[12:54:29] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host puppetboard2002.codfw.wmnet
[12:54:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:33] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-backups: Add docs for logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T146149)
[12:56:13] <wikibugs>	 (03PS1) 10Jbond: sre.misc-clusters.scb: create batch action cook book for scb [cookbooks] - 10https://gerrit.wikimedia.org/r/657802
[12:56:46] <wikibugs>	 (03CR) 10Jcrespo: "🙈" [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo)
[12:59:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.misc-clusters.scb: create batch action cook book for scb [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond)
[12:59:33] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13924 and previous config saved to /var/cache/conftool/dbconfig/20210122-125932-kormat.json
[12:59:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:25] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 75%: Reboot T272121', diff saved to https://phabricator.wikimedia.org/P13925 and previous config saved to /var/cache/conftool/dbconfig/20210122-130525-kormat.json
[13:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:28] <wikibugs>	 (03PS1) 10Elukey: hadoop: make Yarn Spark Shuffle optional [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411)
[13:06:47] <wikibugs>	 (03Abandoned) 10Jbond: sre.misc-clusters.scb: create batch action cook book for scb [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond)
[13:07:30] <wikibugs>	 (03Restored) 10Jbond: sre.misc-clusters.scb: create batch action cook book for scb [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond)
[13:07:37] <wikibugs>	 (03PS2) 10Elukey: hadoop: make Yarn Spark Shuffle optional [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411)
[13:08:08] <wikibugs>	 (03CR) 10Jbond: sre.misc-clusters.scb: create batch action cook book for scb (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond)
[13:09:04] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27621/console" [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[13:10:06] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T146149)
[13:11:34] <wikibugs>	 (03PS13) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[13:11:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27622/console" [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[13:11:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: grant user 'phstats' additional select on phabricator_policy db [puppet] - 10https://gerrit.wikimedia.org/r/657692 (https://phabricator.wikimedia.org/T272654) (owner: 10Aklapper)
[13:12:02] <wikibugs>	 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Marostegui) Just merged it! :)
[13:13:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[13:14:36] <wikibugs>	 (03PS3) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929)
[13:14:36] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13926 and previous config saved to /var/cache/conftool/dbconfig/20210122-131436-kormat.json
[13:14:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:19] <wikibugs>	 (03CR) 10Elukey: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi)
[13:19:53] <wikibugs>	 (03Abandoned) 10Jcrespo: mariadb-backups: Setup x2 production backups [puppet] - 10https://gerrit.wikimedia.org/r/649820 (https://phabricator.wikimedia.org/T269324) (owner: 10Jcrespo)
[13:20:29] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 100%: Reboot T272121', diff saved to https://phabricator.wikimedia.org/P13927 and previous config saved to /var/cache/conftool/dbconfig/20210122-132028-kormat.json
[13:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:16] <wikibugs>	 (03PS3) 10Elukey: hadoop: make Yarn Spark Shuffle optional [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411)
[13:21:32] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host puppetboard2002.codfw.wmnet
[13:21:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:43] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27623/console" [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[13:24:54] <wikibugs>	 (03PS4) 10Elukey: hadoop: make Yarn Spark Shuffle optional [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411)
[13:25:06] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:27:24] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27624/console" [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[13:28:17] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] hadoop: make Yarn Spark Shuffle optional [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[13:29:41] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13929 and previous config saved to /var/cache/conftool/dbconfig/20210122-132939-kormat.json
[13:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:02] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:30:04] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:30:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P13930 and previous config saved to /var/cache/conftool/dbconfig/20210122-133044-marostegui.json
[13:30:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:00] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:06] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:06] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:15] <marostegui>	 !log Stop replication on db1121
[13:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:20] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:30] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:30] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:02] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004848 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:33:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121', diff saved to https://phabricator.wikimedia.org/P13931 and previous config saved to /var/cache/conftool/dbconfig/20210122-133341-marostegui.json
[13:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:14] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::analytics: add metrics for the Backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/657810 (https://phabricator.wikimedia.org/T260411)
[13:35:40] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.925 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[13:36:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Add puppetboard[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/657812 (https://phabricator.wikimedia.org/T264276)
[13:37:38] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:40:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::prometheus::analytics: add metrics for the Backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/657810 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[13:40:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6736006, @Gilles wrote: > Awesome, glad to see that the bisecting paid off! Still 361 commits between thos...
[13:44:45] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13932 and previous config saved to /var/cache/conftool/dbconfig/20210122-134444-kormat.json
[13:44:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:17] <wikibugs>	 (03PS6) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785
[13:45:54] <wikibugs>	 (03CR) 10Klausman: role::ml-serve: Add ml-serve machine role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[13:53:19] <wikibugs>	 (03CR) 10Elukey: "Is role::ml_serve.pp already in puppet? Also left some comments :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[13:57:16] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add puppetboard[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/657812 (https://phabricator.wikimedia.org/T264276) (owner: 10Muehlenhoff)
[14:04:07] <wikibugs>	 (03PS1) 10Elukey: role::analytics_backup_cluster::hadoop::master: use kerberos [puppet] - 10https://gerrit.wikimedia.org/r/657816
[14:05:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_backup_cluster::hadoop::master: use kerberos [puppet] - 10https://gerrit.wikimedia.org/r/657816 (owner: 10Elukey)
[14:09:12] <wikibugs>	 (03PS1) 10Ottomata: Revert "Render kafka cluster connection info in helmfile-defaults/general-*.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/657827
[14:10:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Render kafka cluster connection info in helmfile-defaults/general-*.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/657827 (owner: 10Ottomata)
[14:15:22] <wikibugs>	 (03PS2) 10Ottomata: Revert "Render kafka cluster connection info in helmfile-defaults/general-*.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/657827
[14:16:00] <wikibugs>	 (03PS3) 10Ottomata: Revert "Render kafka cluster connection info in helmfile-defaults..." [puppet] - 10https://gerrit.wikimedia.org/r/657827
[14:17:09] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Revert "Render kafka cluster connection info in helmfile-defaults..." [puppet] - 10https://gerrit.wikimedia.org/r/657827 (owner: 10Ottomata)
[14:17:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Render kafka cluster connection info in helmfile-defaults..." [puppet] - 10https://gerrit.wikimedia.org/r/657827 (owner: 10Ottomata)
[14:18:45] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559)
[14:19:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:20:20] <wikibugs>	 (03PS7) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785
[14:20:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[14:20:56] <wikibugs>	 (03CR) 10Klausman: "> Patch Set 6:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[14:21:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:24:19] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559)
[14:26:30] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jcrespo)
[14:27:51] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jcrespo)
[14:29:01] <wikibugs>	 (03PS4) 10Ottomata: Revert "Render kafka cluster connection info in helmfile-defaults..." [puppet] - 10https://gerrit.wikimedia.org/r/657827
[14:29:29] <wikibugs>	 (03CR) 10Jbond: "updated thanks" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[14:30:16] <wikibugs>	 (03PS8) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785
[14:30:25] <wikibugs>	 (03PS2) 10Jbond: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385
[14:30:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/657786 (owner: 10Muehlenhoff)
[14:30:38] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:56] <wikibugs>	 10SRE, 10Traffic, 10serviceops: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10CDanis) >>! In T272633#6768020, @JMeybohm wrote: > Unfortunately, upstream was not very responsive on my question about adding `Cache-Control` (https://github.com/...
[14:37:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[14:37:22] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:43:28] <wikibugs>	 (03PS4) 10Jbond: dns:  update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141
[14:43:35] <wikibugs>	 (03CR) 10Jbond: "updated thanks" (034 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[14:45:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dns:  update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[14:46:45] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Revert "Render kafka cluster connection info in helmfile-defaults..." [puppet] - 10https://gerrit.wikimedia.org/r/657827 (owner: 10Ottomata)
[14:58:30] <wikibugs>	 (03PS4) 10Andrew Bogott: Nova vendordata: move a bunch of file writes from boot script to cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657721 (https://phabricator.wikimedia.org/T271273)
[14:59:22] <wikibugs>	 (03PS1) 10Gehel: query_service: fix failing WDQS SPARQL icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713)
[15:03:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "As initial set up looks good to me. I had a chat with Tobias about having a different namespace for the cluster, like 'machine_learning' e" [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[15:07:06] <wikibugs>	 (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: ensure HDFS directories [puppet] - 10https://gerrit.wikimedia.org/r/657850
[15:09:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: ensure HDFS directories [puppet] - 10https://gerrit.wikimedia.org/r/657850 (owner: 10Elukey)
[15:11:04] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.033 second response time on 10.192.48.54 port 9042 https://phabricator.wikimedia.org/T93886
[15:11:46] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-b valid until 2022-06-15 10:34:44 +0000 (expires in 508 days) https://phabricator.wikimedia.org/T120662
[15:11:48] <wikibugs>	 (03PS1) 10Ottomata: eventstreams - move client_ip_connection_limit setting to helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/657852 (https://phabricator.wikimedia.org/T269160)
[15:11:52] <icinga-wm>	 RECOVERY - cassandra-b service on restbase2009 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:12:19] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[15:13:14] <wikibugs>	 (03CR) 10Elukey: "didn't add it in time but https://puppet-compiler.wmflabs.org/compiler1003/27626/ looks good as well :)" [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman)
[15:16:35] <wikibugs>	 (03CR) 10Gehel: query_service: fix failing WDQS SPARQL icinga check. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel)
[15:18:08] <wikibugs>	 (03CR) 10ZPapierski: [C: 03+1] query_service: fix failing WDQS SPARQL icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel)
[15:23:29] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventstreams - move client_ip_connection_limit setting to helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/657852 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata)
[15:23:41] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] query_service: fix failing WDQS SPARQL icinga check. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel)
[15:24:21] <moritzm>	 !log installing puppetboard2002
[15:24:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:46] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add a linkrecommendation-external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603)
[15:25:00] <wikibugs>	 (03CR) 10Gehel: query_service: fix failing WDQS SPARQL icinga check. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel)
[15:25:53] <wikibugs>	 (03PS2) 10Gehel: query_service: fix failing WDQS SPARQL icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713)
[15:26:01] <wikibugs>	 (03PS3) 10Gehel: query_service: fix failing WDQS SPARQL icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713)
[15:29:00] <wikibugs>	 (03PS1) 10Elukey: role::ml_serve: fix hiera filename [puppet] - 10https://gerrit.wikimedia.org/r/657857
[15:29:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::ml_serve: fix hiera filename [puppet] - 10https://gerrit.wikimedia.org/r/657857 (owner: 10Elukey)
[15:29:48] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] query_service: fix failing WDQS SPARQL icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel)
[15:30:23] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:02] <wikibugs>	 (03PS2) 10Bstorm: wikireplicas: remove query killer from dedicated replica server [puppet] - 10https://gerrit.wikimedia.org/r/655093 (https://phabricator.wikimedia.org/T269211)
[15:31:40] <wikibugs>	 (03CR) 10Jbond: "recheck" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[15:33:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Adding the team for input on the approach:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[15:34:14] <wikibugs>	 (03PS1) 10Elukey: cumin: add aliases for the ml-serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/657858
[15:34:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dns:  update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[15:36:13] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:36:50] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:14] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1003 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:14] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1008 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:14] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1011 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:15] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:16] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2005 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:17] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2006 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:18] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2008 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:19] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:20] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:21] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:22] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:23] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:23] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) Adding https://metallb.universe.tf/ as a potential solution as well.
[15:37:24] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2001 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:25] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2002 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:26] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:37:32] <gehel>	 WDQS SPARQL errors are me, checking for a typo
[15:38:18] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] "This probably won't *remove* anything from the existing setup, but when things are rebuild multi-instance, it won't get installed on it." [puppet] - 10https://gerrit.wikimedia.org/r/655093 (https://phabricator.wikimedia.org/T269211) (owner: 10Bstorm)
[15:38:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, picking a canary won't hurt, though?" [puppet] - 10https://gerrit.wikimedia.org/r/657858 (owner: 10Elukey)
[15:38:51] <wikibugs>	 10SRE, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10Dwisehaupt) This has been pushed to the frack puppet instance and will be rolling across hosts in the next few minutes.
[15:39:47] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656187 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:40:08] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] interface_automation: Clean up old interfaces on run [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[15:40:36] <wikibugs>	 (03CR) 10Elukey: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/657858 (owner: 10Elukey)
[15:40:52] <moritzm>	 !log installing puppetboard1002
[15:40:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:15] <wikibugs>	 (03PS1) 10Andrew Bogott: nova vendordata: fix issues with the cloud-init apt config [puppet] - 10https://gerrit.wikimedia.org/r/657859 (https://phabricator.wikimedia.org/T271273)
[15:41:17] <wikibugs>	 (03CR) 10Ema: [C: 04-1] varnish: Set debug=1 in X-Analytics header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli)
[15:41:25] <wikibugs>	 (03PS4) 10CRusnov: interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207
[15:41:30] <wikibugs>	 (03PS2) 10Elukey: cumin: add aliases for the ml-serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/657858
[15:41:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova vendordata: move a bunch of file writes from boot script to cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657721 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[15:42:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: fix issues with the cloud-init apt config [puppet] - 10https://gerrit.wikimedia.org/r/657859 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott)
[15:42:26] <wikibugs>	 (03PS2) 10Andrew Bogott: nova vendordata: fix issues with the cloud-init apt config [puppet] - 10https://gerrit.wikimedia.org/r/657859 (https://phabricator.wikimedia.org/T271273)
[15:43:15] <wikibugs>	 (03PS1) 10Ottomata: eventstreams-internal - only 1 replica needed in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/657860 (https://phabricator.wikimedia.org/T269160)
[15:44:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cumin: add aliases for the ml-serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/657858 (owner: 10Elukey)
[15:45:59] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventstreams-internal - only 1 replica needed in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/657860 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata)
[15:51:15] <wikibugs>	 (03PS1) 10Gehel: query_service: use a SPARQL query that is agnostic of the updater. [puppet] - 10https://gerrit.wikimedia.org/r/657861 (https://phabricator.wikimedia.org/T272713)
[15:54:17] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
[15:54:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:03] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] query_service: use a SPARQL query that is agnostic of the updater. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657861 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel)
[15:56:13] <wikibugs>	 (03PS2) 10Gehel: query_service: use a SPARQL query that is agnostic of the updater. [puppet] - 10https://gerrit.wikimedia.org/r/657861 (https://phabricator.wikimedia.org/T272713)
[15:58:19] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] query_service: use a SPARQL query that is agnostic of the updater. [puppet] - 10https://gerrit.wikimedia.org/r/657861 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel)
[16:03:02] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Bstorm)
[16:04:25] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] eventlogging: Remove multiple unused modules [puppet] - 10https://gerrit.wikimedia.org/r/657538 (https://phabricator.wikimedia.org/T272559) (owner: 10Ladsgroup)
[16:09:53] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 (owner: 10CRusnov)
[16:11:23] <wikibugs>	 (03PS2) 10Jbond: sre.misc-clusters.scb: create batch action cook book for scb [cookbooks] - 10https://gerrit.wikimedia.org/r/657802
[16:13:05] <wikibugs>	 (03PS3) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802
[16:13:17] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Bstorm)
[16:13:36] <wikibugs>	 (03PS1) 10Klausman: admin: Let the ml-team admins become root on ml-serve* [puppet] - 10https://gerrit.wikimedia.org/r/657864
[16:16:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond)
[16:19:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:19:34] <jynus>	 !log restart of backup source hosts on codfw T271913
[16:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:52] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:12] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.069 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:12] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.117 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:12] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:13] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.227 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:14] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:15] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:16] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.193 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:17] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:44] <wikibugs>	 (03PS1) 10Andrew Bogott: Update eqiad1 designate to version 'train' [puppet] - 10https://gerrit.wikimedia.org/r/657866 (https://phabricator.wikimedia.org/T261135)
[16:26:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Update eqiad1 designate to version 'train' [puppet] - 10https://gerrit.wikimedia.org/r/657866 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[16:29:49] <wikibugs>	 10SRE, 10vm-requests: eqiad: 1 of VMs requested for Cumin - https://phabricator.wikimedia.org/T272349 (10MoritzMuehlenhoff) 05Open→03Resolved This has been created
[16:30:03] <wikibugs>	 10SRE, 10vm-requests: eqiad/codfw: 1 VMs requested for puppetboard - https://phabricator.wikimedia.org/T272683 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff These have been created.
[16:30:23] <wikibugs>	 (03CR) 10Razzi: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi)
[16:30:47] <wikibugs>	 (03PS4) 10Razzi: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596)
[16:30:58] <wikibugs>	 (03PS1) 10David Caro: wmcs.wmcs-enc-cli: Format data before sending [puppet] - 10https://gerrit.wikimedia.org/r/657869
[16:31:07] <wikibugs>	 (03PS1) 10Andrew Bogott: Move eqiad1/horizon to openstack train [puppet] - 10https://gerrit.wikimedia.org/r/657870 (https://phabricator.wikimedia.org/T261134)
[16:31:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] wmcs.wmcs-enc-cli: Format data before sending [puppet] - 10https://gerrit.wikimedia.org/r/657869 (owner: 10David Caro)
[16:31:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move eqiad1/horizon to openstack train [puppet] - 10https://gerrit.wikimedia.org/r/657870 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott)
[16:33:26] <wikibugs>	 10SRE, 10Phatality, 10observability, 10Developer Productivity: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes - https://phabricator.wikimedia.org/T237706 (10colewhite) 05Open→03Invalid We have upgraded to Kibana 7 which renders this task invalid.  There is still...
[16:33:44] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: frdev1001 ILO inaccessible - https://phabricator.wikimedia.org/T267969 (10Cmjohnson) 05Open→03Resolved It turns out all the ILO network settings were reset, fixed on the console and ILO is accessible.
[16:40:52] <cmjohnson1>	 !log replacing optics/fiber  pfw3a-eqiad:xe-0/0/17 and fasw-c1a-eqiad:xe-0/2/0 T271295
[16:40:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:58] <stashbot>	 T271295: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295
[16:42:22] <wikibugs>	 (03CR) 10Elukey: "This seems good to me, but I think that it would be better if we track this in a task (tagged with SRE-Access-Request) so people are aware" [puppet] - 10https://gerrit.wikimedia.org/r/657864 (owner: 10Klausman)
[16:42:59] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 6: Code-Review-1" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[16:44:08] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@e54225d]: T270411 T270415 T270281 T270277
[16:44:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:20] <stashbot>	 T270277: Add diqwiktionary to RESTBase - https://phabricator.wikimedia.org/T270277
[16:44:20] <stashbot>	 T270415: Add niawiki to RESTBase - https://phabricator.wikimedia.org/T270415
[16:44:20] <stashbot>	 T270411: Add niawiktionary to RESTBase - https://phabricator.wikimedia.org/T270411
[16:44:21] <stashbot>	 T270281: Add bclwiktionary to RESTBase - https://phabricator.wikimedia.org/T270281
[16:44:34] <wikibugs>	 (03CR) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[16:47:10] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.wmcs-enc-cli: Format data before sending [puppet] - 10https://gerrit.wikimedia.org/r/657869 (owner: 10David Caro)
[16:48:48] <wikibugs>	 (03PS2) 10Klausman: admin: Let the ml-team admins become root on ml-serve* [puppet] - 10https://gerrit.wikimedia.org/r/657864 (https://phabricator.wikimedia.org/T272687)
[16:49:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:49:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: Let the ml-team admins become root on ml-serve* [puppet] - 10https://gerrit.wikimedia.org/r/657864 (https://phabricator.wikimedia.org/T272687) (owner: 10Klausman)
[16:49:40] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:49:49] <wikibugs>	 (03CR) 10Klausman: "> This seems good to me, but I think that it would be better if we track this in a task (tagged with SRE-Access-Request) so people are awa" [puppet] - 10https://gerrit.wikimedia.org/r/657864 (https://phabricator.wikimedia.org/T272687) (owner: 10Klausman)
[16:50:23] <wikibugs>	 (03PS3) 10Klausman: admin: Let the ml-team admins become root on ml-serve* [puppet] - 10https://gerrit.wikimedia.org/r/657864 (https://phabricator.wikimedia.org/T272687)
[16:51:00] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] admin: Let the ml-team admins become root on ml-serve* [puppet] - 10https://gerrit.wikimedia.org/r/657864 (https://phabricator.wikimedia.org/T272687) (owner: 10Klausman)
[16:51:52] <wikibugs>	 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10Cmjohnson) 05Open→03Resolved Everything is replaced, kept the same cable number if that matters.
[16:57:41] <wikibugs>	 (03PS7) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476)
[17:01:43] <wikibugs>	 (03PS1) 10Mforns: analytics:refinery: Bump up Refine/DruidLoad jar versions to 0.0.145 [puppet] - 10https://gerrit.wikimedia.org/r/657875
[17:02:40] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 6:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[17:05:59] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] analytics:refinery: Bump up Refine/DruidLoad jar versions to 0.0.145 [puppet] - 10https://gerrit.wikimedia.org/r/657875 (owner: 10Mforns)
[17:12:28] <wikibugs>	 (03PS1) 10Jbond: sre.apt.audit: add script to audit manually installed package [cookbooks] - 10https://gerrit.wikimedia.org/r/657877
[17:13:04] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@eea071d]: Extra bug-fix train [analytics/refinery@eea071def90a8a856b1e04dda23b77a850134253]
[17:13:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:31] <wikibugs>	 (03CR) 10Vgutierrez: "> Patch Set 7:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[17:19:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:19:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10Dzahn) @jcrespo Thank you :)
[17:19:50] <wikibugs>	 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10Dzahn) @akosiaris Thank you!:)
[17:19:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:21:51] <wikibugs>	 (03PS2) 10Jbond: sre.apt.audit: add script to audit manually installed package [cookbooks] - 10https://gerrit.wikimedia.org/r/657877
[17:22:36] <wikibugs>	 (03CR) 10Dzahn: "thanks for the fix .. and the revert" [puppet] - 10https://gerrit.wikimedia.org/r/657783 (https://phabricator.wikimedia.org/T272555) (owner: 10Jcrespo)
[17:23:07] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@eea071d]: Extra bug-fix train [analytics/refinery@eea071def90a8a856b1e04dda23b77a850134253] (duration: 10m 03s)
[17:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:27] <icinga-wm>	 PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:24:25] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:25:20] <wikibugs>	 (03PS3) 10Jbond: sre.apt.audit: add script to audit manually installed package [cookbooks] - 10https://gerrit.wikimedia.org/r/657877
[17:25:45] <icinga-wm>	 RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[17:28:39] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:29:40] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@eea071d] (thin): Extra bug-fix train THIN [analytics/refinery@eea071def90a8a856b1e04dda23b77a850134253]
[17:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:48] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@eea071d] (thin): Extra bug-fix train THIN [analytics/refinery@eea071def90a8a856b1e04dda23b77a850134253] (duration: 00m 07s)
[17:29:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:31] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:30:46] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:31:14] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:31:32] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:32:52] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 7:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[17:35:19] <wikibugs>	 (03CR) 10Vgutierrez: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[17:38:47] <wikibugs>	 (03CR) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[17:49:32] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2358.codfw.wmnet with reason: REIMAGE
[17:49:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:45] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@e54225d]: T270411 T270415 T270281 T270277 (duration: 65m 37s)
[17:49:47] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2360.codfw.wmnet with reason: REIMAGE
[17:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:50] <stashbot>	 T270277: Add diqwiktionary to RESTBase - https://phabricator.wikimedia.org/T270277
[17:49:51] <stashbot>	 T270415: Add niawiki to RESTBase - https://phabricator.wikimedia.org/T270415
[17:49:51] <stashbot>	 T270411: Add niawiktionary to RESTBase - https://phabricator.wikimedia.org/T270411
[17:49:51] <stashbot>	 T270281: Add bclwiktionary to RESTBase - https://phabricator.wikimedia.org/T270281
[17:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:14] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2362.codfw.wmnet with reason: REIMAGE
[17:50:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:35] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2364.codfw.wmnet with reason: REIMAGE
[17:50:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:42] <wikibugs>	 (03Abandoned) 10Ottomata: eventgate-{main,analytics,logging-external} - bump to 2020-12-02-151648-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644922 (https://phabricator.wikimedia.org/T266573) (owner: 10Ottomata)
[17:52:07] <mutante>	 !log releases2001 - create new partition table with fdisk, make ext4 filesystem on /dev/vdb1
[17:52:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:12] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2362.codfw.wmnet with reason: REIMAGE
[17:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:23] <wikibugs>	 (03PS1) 10Ottomata: eventgate-main - bump to 2021-01-22-173634-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/657885 (https://phabricator.wikimedia.org/T262226)
[17:52:38] <wikibugs>	 (03CR) 10Ottomata: "To deploy on Monday." [deployment-charts] - 10https://gerrit.wikimedia.org/r/657885 (https://phabricator.wikimedia.org/T262226) (owner: 10Ottomata)
[17:53:59] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2358.codfw.wmnet with reason: REIMAGE
[17:54:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:47] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2360.codfw.wmnet with reason: REIMAGE
[17:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:35] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2364.codfw.wmnet with reason: REIMAGE
[17:55:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:10] <wikibugs>	 (03PS4) 10Jbond: (WIP) sre.apt.audit: produce a report of manually packages [cookbooks] - 10https://gerrit.wikimedia.org/r/657877
[17:57:43] <mutante>	 !log releases1002 (releases.wm.org active backend) - rebooting - hopefully it does not run into T272555 but if it does now it's known how to fix
[17:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:47] <stashbot>	 T272555: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555
[17:58:51] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:59:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) sre.apt.audit: produce a report of manually packages [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 (owner: 10Jbond)
[17:59:12] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2360.codfw.wmnet with reason: new install on buster
[17:59:13] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2360.codfw.wmnet with reason: new install on buster
[17:59:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:05] <icinga-wm>	 PROBLEM - Host releases1002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:01:02] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on releases1002.eqiad.wmnet with reason: fixing networking - added disk
[18:01:02] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on releases1002.eqiad.wmnet with reason: fixing networking - added disk
[18:01:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:37] <marxarelli>	 mutante: same issue as with releases2002?
[18:02:19] <icinga-wm>	 RECOVERY - Host releases1002 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[18:02:39] <mutante>	 marxarelli: yes, same issue. but thanks to alex I know the fix and applied it
[18:03:06] <marxarelli>	 ack
[18:03:07] <mutante>	 !log releases1002 - replaced ens5 with ens6 in /etc/network/interfaaces and rebooted again
[18:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:19] <mutante>	 marxarelli: releases2002 already has the new disk mounted at /srv/docker
[18:05:09] <mutante>	 and ... same now on releases1002. created ext4 fs and mounted
[18:05:11] <marxarelli>	 got it. that works. i'll add the right `profile::docker::settings` then
[18:05:26] <marxarelli>	 ty!
[18:05:55] <mutante>	 I think when we replace these VMs, we will just make larger disks, so we don't need to worry more about it
[18:06:05] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2364 is CRITICAL: Host mw2364 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[18:06:06] <mutante>	 except adding to fstab so they don't disappear after reboot.. doing that
[18:06:44] <wikibugs>	 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10calbon) kbazira needs access too
[18:07:40] <mutante>	 it looks like 147 / 140 free after i told Ganeti i want 150G, it's 1000 vs 1024
[18:07:53] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is OK: TCP OK - 0.034 second response time on 10.192.48.55 port 9042 https://phabricator.wikimedia.org/T93886
[18:08:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw2364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[18:09:12] <RhinosF1>	 mutante: does that want downtiming if it's you doing buster ^
[18:10:21] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2364 is CRITICAL: Host mw2364 is not in mediawiki-installation dsh group daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[18:10:21] <icinga-wm>	 ACKNOWLEDGEMENT - Apache HTTP on mw2364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Application_servers
[18:10:21] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2364 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reimaging
[18:10:45] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:10:52] <mutante>	 RhinosF1: yea, the thing is that maybe 1 out of 10 cases setting the downtime fails. fixed!
[18:10:59] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2022-06-15 10:34:47 +0000 (expires in 508 days) https://phabricator.wikimedia.org/T120662
[18:11:06] <RhinosF1>	 mutante: np!
[18:11:29] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[18:11:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw2364 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[18:11:42] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2358.codfw.wmnet'] `  an...
[18:11:58] <mutante>	 talking about icinga alerts.. because I am looking at the web UI.. what about all the wdqs alerts about SPARQL 
[18:12:00] <marxarelli>	 mutante: 140G is great :) thank you
[18:12:11] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2360.codfw.wmnet'] `  an...
[18:12:20] <mutante>	 marxarelli: you're welcome, i'll give the ticket back in a moment
[18:12:20] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2362.codfw.wmnet'] `  an...
[18:12:48] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2364.codfw.wmnet'] `  an...
[18:17:45] <mutante>	 !log releases2002 - rebooting to confirm works now and also new disk gets auto-mounted
[18:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:02] <wikibugs>	 (03PS5) 10Dduvall: releases: Provide docker to PipelineLib based jobs [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477)
[18:20:30] <wikibugs>	 (03PS1) 10Jbond: enable-puppet: allow fall back to enable puppet disabled by root [puppet] - 10https://gerrit.wikimedia.org/r/657889 (https://phabricator.wikimedia.org/T272539)
[18:20:38] <mutante>	 marxarelli: confirmed releases2002 survives reboot and the new disk gets mounted automatically. done
[18:20:47] <marxarelli>	 \o/
[18:21:04] <marxarelli>	 i've tweaked the puppet patch to set `data-dir: /srv/docker`
[18:25:42] <mutante>	 great! *nod*
[18:28:03] <wikibugs>	 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10Dzahn) releases1002 had the exact same issue.. so confirmed it was caused by adding the new disk.  The same fix (ens5->ens6) also resolved it again.
[18:28:59] <wikibugs>	 (03PS1) 10Bstorm: data-services: apply user variances to future creations [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399)
[18:29:41] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10jbond) i sent a quick patch for this.  in an ideal world it would be good to make it so only cumin* can disable puppet as root and when it dose so it also appends the user string (simi...
[18:33:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] releases: Provide docker to PipelineLib based jobs [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall)
[18:34:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I created /srv/docker." [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall)
[18:35:40] <mutante>	 marxarelli: oops, Function lookup() did not find a value for the name 'profile::docker::engine::declare_service
[18:36:05] <marxarelli>	 mutante: oh shoot. ok
[18:36:06] * marxarelli looks
[18:37:24] <mutante>	 looks like i touched that last, heh
[18:37:34] <mutante>	 but just hiera->lookup replacement
[18:38:10] <marxarelli>	 let's see what it is for contint1001...
[18:38:19] <mutante>	 marxarelli: i guess it should be false
[18:38:21] <mutante>	 ok
[18:38:26] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/pe
[18:38:26] <icinga-wm>	 t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:38:36] <mutante>	 "# We want this to be on if we want to use a different docker systemd service (with flannel support, for eg.)"
[18:38:38] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:38:53] <marxarelli>	 that seems right
[18:39:02] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[18:39:14] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikime
[18:39:14] <icinga-wm>	 ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:39:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:39:56] <wikibugs>	 (03PS1) 10Jgreen: replace check_swap with check_memory globally in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/657893
[18:39:59] <mutante>	 marxarelli: somehow this is not set on contint and i just see it on kubernetes
[18:40:20] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:40:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:40:44] <marxarelli>	 i see it declared in `hieradata/role/common/builder.yaml` too
[18:40:58] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:41:03] <marxarelli>	 but i don't think that applies to contints
[18:41:11] <mutante>	 contint is not using the docker::engine class  
[18:41:17] <mutante>	 apparently
[18:41:23] <marxarelli>	 ok
[18:41:32] <marxarelli>	 i'll just set it to false for releases
[18:41:38] <marxarelli>	 that should be fine
[18:41:39] <mutante>	 yea, let's do that
[18:41:45] <mutante>	 are you making a patch?
[18:42:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:42:33] <marxarelli>	 yep!
[18:43:03] <mutante>	 cool, will merge, be back in a minute
[18:43:29] <wikibugs>	 (03PS1) 10Andrew Bogott: Designate monitoring: make regexp for designate-api detection more lenient [puppet] - 10https://gerrit.wikimedia.org/r/657894
[18:44:09] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] replace check_swap with check_memory globally in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/657893 (owner: 10Jgreen)
[18:44:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Designate monitoring: make regexp for designate-api detection more lenient [puppet] - 10https://gerrit.wikimedia.org/r/657894 (owner: 10Andrew Bogott)
[18:44:39] <wikibugs>	 (03PS1) 10Dduvall: releases: Set declare_service: false for docker [puppet] - 10https://gerrit.wikimedia.org/r/657895 (https://phabricator.wikimedia.org/T271477)
[18:45:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:45:28] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[18:45:28] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikime
[18:45:28] <icinga-wm>	 ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:45:54] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:45:55] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2358.codfw.wmnet
[18:45:55] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2360.codfw.wmnet
[18:45:55] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2364.codfw.wmnet
[18:45:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2362.codfw.wmnet
[18:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:36] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[18:46:46] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2358.codfw.wmnet
[18:46:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:02] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2360.codfw.wmnet
[18:47:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:26] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:47:26] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[18:47:36] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2362.codfw.wmnet
[18:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:43] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2364.codfw.wmnet
[18:47:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:48] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:49:09] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:50:03] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:51:13] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:51:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] releases: Set declare_service: false for docker [puppet] - 10https://gerrit.wikimedia.org/r/657895 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall)
[18:52:36] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/pe
[18:52:36] <icinga-wm>	 t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:53:10] <mutante>	 Docker::Configuration/File[/etc/docker]/ensure: created
[18:53:46] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:55:27] <marxarelli>	 mutante: great! and looks like the jenkins agent has access to the socket so we're good to go
[18:57:10] <mutante>	 marxarelli: can confirm there is now dockerd and docker-containerd running on both releases*
[18:57:54] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikime
[18:57:54] <icinga-wm>	 ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[18:58:03] <elukey>	 we are checking AQS --^
[18:58:20] <mutante>	 elukey: thank you
[18:58:25] <mutante>	 marxarelli: claimed T208529 is resolved
[18:58:26] <stashbot>	 T208529: Install docker on releases-jenkins - https://phabricator.wikimedia.org/T208529
[18:58:36] <wikibugs>	 10SRE, 10SRE-tools: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10jcrespo) p:05Triage→03Low Marking it as low, hopefully to be done at some point, but I couldn't find the time this week, and it is unlikely to...
[18:59:01] <marxarelli>	 mutante: ack. ty!
[19:00:54] <icinga-wm>	 PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:42] <mutante>	 oh.. let's see ^
[19:02:20] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.w
[19:02:20] <icinga-wm>	 /Services/Monitoring/aqs
[19:02:30] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[19:03:36] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[19:03:37] <mutante>	 marxarelli: grbml...  ifup@ens6.service loaded failed failed ifup for ens6
[19:03:40] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[19:03:49] <mutante>	 that is related to the fix with the renamed interface 
[19:03:50] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[19:04:00] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 12.85 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[19:04:03] <mutante>	 and shows up now after docker was added.. hrmm
[19:05:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:05:40] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is CRITICAL: connect to address 10.64.48.148 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[19:06:32] <icinga-wm>	 PROBLEM - cassandra-a service on aqs1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:06:48] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2364 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:07:04] <icinga-wm>	 PROBLEM - Check systemd state on aqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:07:08] <marxarelli>	 mutante: that's odd. and release2002 seems fine
[19:07:18] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:07:49] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2356.codfw.wmnet with reason: REIMAGE
[19:07:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:09] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2354.codfw.wmnet with reason: REIMAGE
[19:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:43] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10KFrancis) @jcrespo I am confirming the NDA is fully executed.  Please proceed with the access request.  Thanks!
[19:09:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2352.codfw.wmnet with reason: REIMAGE
[19:09:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:50] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2356.codfw.wmnet with reason: REIMAGE
[19:09:54] <mutante>	 !log releases1002 systemctl reset-failed
[19:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:16] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2350.codfw.wmnet with reason: REIMAGE
[19:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:46] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2354.codfw.wmnet with reason: REIMAGE
[19:11:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:48] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 10.86 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[19:13:33] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2352.codfw.wmnet with reason: REIMAGE
[19:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:16] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2350.codfw.wmnet with reason: REIMAGE
[19:15:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:19:06] <icinga-wm>	 PROBLEM - Memcached on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached
[19:19:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:22:50] <icinga-wm>	 ACKNOWLEDGEMENT - Memcached on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Memcached
[19:29:36] <icinga-wm>	 RECOVERY - cassandra-a service on aqs1006 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:30:22] <icinga-wm>	 RECOVERY - Check systemd state on aqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:30:26] <icinga-wm>	 RECOVERY - Memcached on mw2350 is OK: TCP OK - 0.034 second response time on 10.192.32.200 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[19:30:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2354.codfw.wmnet'] `  an...
[19:31:15] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2356.codfw.wmnet'] `  an...
[19:31:32] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2352.codfw.wmnet'] `  an...
[19:32:20] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2350.codfw.wmnet'] `  an...
[19:34:43] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2356.codfw.wmnet
[19:34:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2354.codfw.wmnet
[19:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:31] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2350.codfw.wmnet
[19:35:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:46] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is OK: TCP OK - 0.000 second response time on 10.64.48.148 port 9042 https://phabricator.wikimedia.org/T93886
[19:35:49] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2352.codfw.wmnet
[19:35:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2350.codfw.wmnet
[19:36:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:06] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:30] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10jbond) >>! In T272539#6769671, @jbond wrote: > i sent a quick patch for this.  in an ideal world it would be good to make it so only cumin* can disable puppet as root and when it dose...
[19:38:17] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2352.codfw.wmnet
[19:38:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:27] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2354.codfw.wmnet
[19:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:59] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2356.codfw.wmnet
[19:39:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:42] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:41:00] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:41:08] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:41:27] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:41:32] <icinga-wm>	 RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:43:35] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:45:04] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:49:30] <wikibugs>	 10SRE, 10observability: Grafana error: "parse error at char 1: unexpected character: '\\ufeff'" when copy-pasting metric names - https://phabricator.wikimedia.org/T263624 (10colewhite) I was able to reproduce this issue with select+middle click.  Upstream has an issue on file: https://github.com/grafana/grafan...
[19:53:10] <wikibugs>	 (03PS2) 10Legoktm: docker_registry_ha: Add timestamp to build-homepage output [puppet] - 10https://gerrit.wikimedia.org/r/657678 (https://phabricator.wikimedia.org/T179696)
[19:55:22] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Add timestamp to build-homepage output [puppet] - 10https://gerrit.wikimedia.org/r/657678 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[19:55:36] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add Lumen transit BGP in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/657777 (https://phabricator.wikimedia.org/T269808) (owner: 10Ayounsi)
[19:56:27] <wikibugs>	 (03Merged) 10jenkins-bot: Add Lumen transit BGP in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/657777 (https://phabricator.wikimedia.org/T269808) (owner: 10Ayounsi)
[19:57:50] <wikibugs>	 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10Dzahn) >>! In T272555#6768676, @akosiaris wrote: > let's document this.  added this  https://wikitech.wikimedia.org/w/index.php?title=Ganeti&type=revision&diff=1894909&oldid=1893790
[19:58:26] <wikibugs>	 (03PS2) 10Legoktm: libraryupgrader: Update celery systemd units [puppet] - 10https://gerrit.wikimedia.org/r/657459
[19:59:44] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2334.codfw.wmnet with reason: REIMAGE
[19:59:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:00] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] libraryupgrader: Update celery systemd units [puppet] - 10https://gerrit.wikimedia.org/r/657459 (owner: 10Legoktm)
[20:00:01] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2332.codfw.wmnet with reason: REIMAGE
[20:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:11] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2330.codfw.wmnet with reason: REIMAGE
[20:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:29] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2328.codfw.wmnet with reason: REIMAGE
[20:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:59] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1413.eqiad.wmnet with reason: REIMAGE
[20:01:00] <wikibugs>	 (03CR) 10ArielGlenn: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man)
[20:01:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:47] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2334.codfw.wmnet with reason: REIMAGE
[20:01:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:51] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1413.eqiad.wmnet with reason: REIMAGE
[20:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:46] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2328.codfw.wmnet with reason: REIMAGE
[20:03:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:01] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2332.codfw.wmnet with reason: REIMAGE
[20:05:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:11] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2330.codfw.wmnet with reason: REIMAGE
[20:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:42] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1268.eqiad.wmnet with reason: REIMAGE
[20:05:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:24] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2330 is CRITICAL: Host mw2330 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[20:06:30] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:06:34] <icinga-wm>	 PROBLEM - Memcached on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached
[20:07:44] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1268.eqiad.wmnet with reason: REIMAGE
[20:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw2330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:12:30] <icinga-wm>	 ACKNOWLEDGEMENT - Memcached on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Memcached
[20:12:30] <icinga-wm>	 ACKNOWLEDGEMENT - PHP7 rendering on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:12:30] <icinga-wm>	 ACKNOWLEDGEMENT - Apache HTTP on mw2330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Application_servers
[20:12:30] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2330 is CRITICAL: Host mw2330 is not in mediawiki-installation dsh group daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[20:12:30] <icinga-wm>	 ACKNOWLEDGEMENT - PHP7 rendering on mw2332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:13:17] <wikibugs>	 (03PS2) 10Legoktm: scap: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655795 (https://phabricator.wikimedia.org/T266479)
[20:20:05] <icinga-wm>	 PROBLEM - Host mw2332 is DOWN: PING CRITICAL - Packet loss = 100%
[20:21:11] <icinga-wm>	 RECOVERY - Host mw2332 is UP: PING OK - Packet loss = 0%, RTA = 31.87 ms
[20:21:31] <icinga-wm>	 PROBLEM - Host mw1413 is DOWN: PING CRITICAL - Packet loss = 100%
[20:21:57] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2334.codfw.wmnet'] `  an...
[20:22:27] <icinga-wm>	 RECOVERY - Host mw1413 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[20:22:29] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2330.codfw.wmnet'] `  an...
[20:22:36] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2328.codfw.wmnet'] `  an...
[20:23:07] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2332.codfw.wmnet'] `  an...
[20:23:51] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1413.eqiad.wmnet'] `  an...
[20:25:33] <icinga-wm>	 RECOVERY - Memcached on mw1413 is OK: TCP OK - 0.000 second response time on 10.64.32.132 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[20:25:53] <wikibugs>	 (03PS1) 10Legoktm: threedtopng: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657903 (https://phabricator.wikimedia.org/T266479)
[20:26:55] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27630/console" [puppet] - 10https://gerrit.wikimedia.org/r/657903 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[20:28:13] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27629/" [puppet] - 10https://gerrit.wikimedia.org/r/655795 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[20:32:15] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2332 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:33:41] <wikibugs>	 (03PS1) 10Legoktm: udp2log: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657904 (https://phabricator.wikimedia.org/T266479)
[20:34:29] <icinga-wm>	 RECOVERY - Apache HTTP on mw2330 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:35:16] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27631/console" [puppet] - 10https://gerrit.wikimedia.org/r/657904 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[20:37:23] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:38:03] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.033 second response time on 10.192.48.56 port 9042 https://phabricator.wikimedia.org/T93886
[20:49:31] <wikibugs>	 (03PS1) 10Ottomata: [WIP] eventgate -  Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237)
[20:50:51] <wikibugs>	 (03PS2) 10Ottomata: [WIP] eventgate -  Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237)
[20:55:48] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1268.eqiad.wmnet'] `  an...
[20:59:17] <wikibugs>	 10SRE, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10Dwisehaupt) 05Open→03Resolved
[21:05:44] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: use envoy listener for wdqs-internal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657910
[21:07:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wdqs: use envoy listener for wdqs-internal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657910 (owner: 10Ryan Kemper)
[21:12:45] <wikibugs>	 (03CR) 10Urbanecm: Create Contact page for Ombuds commission at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) (owner: 10Luke081515)
[21:13:20] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "sounds good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) (owner: 10Luke081515)
[21:25:14] <wikibugs>	 (03Abandoned) 10Ryan Kemper: wdqs: use envoy listener for wdqs-internal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657910 (owner: 10Ryan Kemper)
[21:46:05] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: use envoy for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/657913
[21:56:36] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Revert "[enwiki] Update celebration logo to "option A""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657914 (https://phabricator.wikimedia.org/T272747)
[21:58:14] <wikibugs>	 (03PS2) 10Urbanecm: Revert "Revert "[enwiki] Update celebration logo to "option A""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657914 (https://phabricator.wikimedia.org/T272526)
[22:05:39] <wikibugs>	 (03PS1) 10Legoktm: snapshot: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657916 (https://phabricator.wikimedia.org/T266479)
[22:08:07] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2334 is CRITICAL: Host mw2334 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[22:08:23] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27634/console" [puppet] - 10https://gerrit.wikimedia.org/r/657916 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[22:13:49] <wikibugs>	 (03PS1) 10Legoktm: superset: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657917 (https://phabricator.wikimedia.org/T266479)
[22:14:40] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27635/console" [puppet] - 10https://gerrit.wikimedia.org/r/657917 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[22:21:01] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Wikidata, 10serviceops, and 3 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10Addshore)
[22:26:41] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1268 is CRITICAL: Host mw1268 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[22:27:21] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1413 is CRITICAL: Host mw1413 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[22:30:41] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:37:19] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:38:39] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2332 is CRITICAL: Host mw2332 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[22:40:20] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10thcipriani) 05Open→03Resolved
[22:40:24] <wikibugs>	 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 3 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10thcipriani)
[22:41:39] <logmsgbot>	 !log reedy@deploy1001 Synchronized invalid.json: (no justification provided) (duration: 00m 58s)
[22:41:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:47] * Reedy looks at legoktm
[22:41:56] <legoktm>	 uhoh
[22:42:08] <legoktm>	 using sync-file?
[22:42:11] <Reedy>	 yeah
[22:42:28] <Reedy>	 I'm not gonna run scap all of the things
[22:42:36] <legoktm>	 right
[22:42:42] <legoktm>	 but that's a bad regression
[22:42:53] <Reedy>	 one more for luck
[22:43:22] <Reedy>	 legoktm: definitely works for PHP
[22:43:31] <Reedy>	 CalledProcessError: Command 'find -O2 '/srv/mediawiki-staging/invalid.php' -not -type d -name '*.php' -not -name 'autoload_static.php'  -or -name '*.inc' | xargs -n1 -P30 -exec php -l >/dev/null 2>&1' returned non-zero exit status 124
[22:43:31] <Reedy>	 22:43:16 sync-file failed: <CalledProcessError> Command 'find -O2 '/srv/mediawiki-staging/invalid.php' -not -type d -name '*.php' -not -name 'autoload_static.php'  -or -name '*.inc' | xargs -n1 -P30 -exec php -l >/dev/null 2>&1' returned non-zero exit status 124
[22:43:38] <Reedy>	 crap error message, sure, but bails
[22:44:03] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2328 is CRITICAL: Host mw2328 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[22:47:52] <legoktm>	 >>> list(os.walk('invalid.json'))
[22:47:52] <legoktm>	 []
[22:51:12] <Reedy>	 que?
[22:51:17] <legoktm>	 patch incoming
[22:54:14] <Reedy>	 bugsbugsbugs
[23:02:46] <legoktm>	 https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/657921/
[23:31:21] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:37:14] <wikibugs>	 (03Abandoned) 10Urbanecm: Revert "Revert "[enwiki] Update celebration logo to "option A""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657914 (https://phabricator.wikimedia.org/T272526) (owner: 10Urbanecm)
[23:38:11] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:47:47] <icinga-wm>	 RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:53:05] <wikibugs>	 10SRE, 10Wikidata, 10wdwb-tech-focus: entity/Q64 was indexed in Google, should it have been? - https://phabricator.wikimedia.org/T227246 (10Addshore)
[23:54:45] <icinga-wm>	 PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state