[00:00:02] Daimona: holding this window for resolving the current train situation. [00:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210122T0000). [00:00:04] annet and kemayo: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:07] brennen: zuul is killing me [00:00:26] Mmmm, stickers. [00:00:41] above cc: annet, Kemayo, Urbanecm. [00:00:46] I'm here! [00:01:05] currently dealing with a train blocker; hopefully we can get that cleared out in time for backport window to proceed. [00:01:13] apologies for the delay. [00:01:26] No worries [00:01:31] no prob! thanks for the update [00:01:41] PROBLEM - PHP7 rendering on mw2370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:02:33] ^ something went wrong during reimaging.. got it [00:03:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2370.codfw.wmnet with reason: new install on buster [00:03:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2370.codfw.wmnet with reason: new install on buster [00:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:23] Ack brennen, feel free to ping me if deployment can resume. [00:03:39] will do. [00:04:27] My stuff won't be greatly hurt if it has to wait for Monday. [00:06:25] brennen: is there sth I can do to help train? [00:06:40] (03CR) 10Bstorm: "Deployed this to toolsbeta and kicked over one of the api server pods just in case. It didn't seem to care. Merging." [puppet] - 10https://gerrit.wikimedia.org/r/639883 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [00:06:55] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: AdmissionsConfiguration is GA after 1.17 [puppet] - 10https://gerrit.wikimedia.org/r/639883 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [00:07:02] Urbanecm: i think we're good, thank you though [00:07:11] just waiting on zuul [00:07:11] are we just staring really hard at https://integration.wikimedia.org/zuul/ at the moment? That mobilefrontend patch? [00:07:17] yep [00:07:20] pretty much [00:07:27] wondering why it takes so long [00:07:37] brennen: ack :) [00:07:46] thcipriani: correct. [00:07:47] also watching hundreds thousands of errors coming in on my other tab [00:08:00] ugh [00:08:44] Force merge is a thing if we can't wait [00:08:50] (03CR) 10Bstorm: "I don't think we need to restart the apiservers just now, but they will be fine when they do restart. This will also keep things sane when" [puppet] - 10https://gerrit.wikimedia.org/r/639883 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [00:10:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2370.codfw.wmnet [00:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:32] (03Merged) 10jenkins-bot: Fix toggling storage cleanup [extensions/MobileFrontend] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657652 (https://phabricator.wikimedia.org/T272638) (owner: 10Brennen Bearnes) [00:11:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [00:13:00] brennen: finally! [00:13:01] ACKNOWLEDGEMENT - Host releases2002 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T272555 [00:14:06] (03PS2) 10Dzahn: Remove obsolete role installserver::apt [puppet] - 10https://gerrit.wikimedia.org/r/657569 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff) [00:14:08] (03PS1) 10Dzahn: admin: update SSH key for Volker_E [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) [00:14:36] brennen: ready to test [00:14:53] Jdlrobson: syncing, one moment [00:15:04] (03PS1) 10Daimona Eaytoy: Adjust AF config for ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657706 [00:15:30] Jdlrobson: should be on mwdebug1002 [00:15:56] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2372.codfw.wmnet'] ` an... [00:16:05] RECOVERY - PHP7 rendering on mw2370 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:16:46] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2370.codfw.wmnet'] ` an... [00:16:47] Daimona_: I recommend clarifying the commit message if possible and linking that private task in commit message (assuming it's related) [00:18:23] brennen: testing [00:18:26] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2372.codfw.wmnet [00:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:14] Yeah it is [00:19:14] brennen: please sync! [00:19:32] Jdlrobson: thanks, syncing [00:19:41] Daimona_: will sync it then once deployment resume (should happen soon) [00:19:58] (03PS2) 10Daimona Eaytoy: Adjust AF config for ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657706 (https://phabricator.wikimedia.org/T272330) [00:20:20] Thank you :) [00:20:38] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [00:20:42] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/MobileFrontend: Backport: [[gerrit:657702|Fix toggling storage cleanup (T272638)]] (duration: 01m 07s) [00:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:47] T272638: TypeError: null is not an object (evaluating 't[e.title]') on mobile domain - https://phabricator.wikimedia.org/T272638 [00:21:16] (03PS1) 10RobH: dhcp entries for new db systems [puppet] - 10https://gerrit.wikimedia.org/r/657710 (https://phabricator.wikimedia.org/T267043) [00:22:12] (03CR) 10RobH: [C: 03+2] dhcp entries for new db systems [puppet] - 10https://gerrit.wikimedia.org/r/657710 (https://phabricator.wikimedia.org/T267043) (owner: 10RobH) [00:23:39] brennen are we still going to roll 26 > 27? [00:23:50] (03CR) 10Urbanecm: [C: 03+2] Adjust AF config for ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657706 (https://phabricator.wikimedia.org/T272330) (owner: 10Daimona Eaytoy) [00:24:03] If not I will need to prepare a backport for 1.36.0-wmf.26 [00:24:14] ^^that was approved by brennen in other channel^^ [00:24:17] Jdlrobson: yeah, a quick config patch going out first then that. [00:24:32] 👍 [00:25:24] (03Merged) 10jenkins-bot: Adjust AF config for ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657706 (https://phabricator.wikimedia.org/T272330) (owner: 10Daimona Eaytoy) [00:25:54] Jdlrobson: i perhaps wasn't quite thinking clearly about this, but i guess that should be backported to .26 in case a rollback becomes necessary regardless? [00:26:00] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [00:27:07] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d4f5d6f09977962be1c49471432125a92357ede6: Temporarily amend ukwiki AF configuration (T272330) (duration: 01m 03s) [00:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:11] brennen: over to you :) [00:28:03] Urbanecm: ack, rolling to group2. [00:28:45] (03PS1) 10Brennen Bearnes: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657712 [00:28:47] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657712 (owner: 10Brennen Bearnes) [00:29:31] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657712 (owner: 10Brennen Bearnes) [00:30:20] brennen: nope [00:30:23] provided we roll forward [00:30:27] no backport to .26 needed [00:30:37] the only reason we have an error is we backported after running bad wmf27 code [00:30:39] it's all a bit confusing [00:31:15] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['db1156.eqiad.wmnet', 'db1157.eqiad.wmnet', 'db1158.eqiad.wmnet',... [00:31:22] yeah, i'm very much used to thinking in the other direction. [00:31:26] haha me too [00:31:30] my head is almost exploding [00:31:37] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.27 [00:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:29] ok, there we are. Urbanecm, i think perhaps we should give it a minute or two, but you should be clear for backports after that. [00:33:57] \o/ well trained [00:34:13] brennen: ack. May I +2 backports to save some zuul waiting? [00:34:22] please do [00:34:48] annet: Kemayo: your backports will be ready once they merge, fyi :) [00:35:06] awesome [00:35:10] (03CR) 10Urbanecm: [C: 03+2] Distinguish between null continue value and unknown one [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657623 (https://phabricator.wikimedia.org/T272548) (owner: 10Anne Tomasevich) [00:35:14] Righteous. [00:35:20] brennen: we're live? [00:35:34] (03CR) 10Urbanecm: [C: 03+2] A/B test output when a specific feature is being tested [extensions/DiscussionTools] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657653 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch) [00:35:41] Jdlrobson: we are. i... think client errors are slowly dropping? [00:35:51] brennen: im not seeing the error when testing so that's good [00:35:52] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2370.codfw.wmnet [00:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:55] ill keep an eye on the graphs :) [00:36:13] 5 - 10 mins of data should show us we're clear [00:36:18] cool. [00:37:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2372.codfw.wmnet [00:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:07] so far so good :) [00:38:17] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [00:39:12] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [00:39:41] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2376.codfw.wmnet with reason: REIMAGE [00:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:13] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [00:41:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2376.codfw.wmnet with reason: REIMAGE [00:41:42] (03PS1) 10Urbanecm: Don't return the status of doBlockInternal when processing block actions [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657654 [00:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:45] (03PS2) 10Legoktm: Don't return the status of doBlockInternal when processing block actions [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657654 (owner: 10Urbanecm) [00:41:56] hah [00:42:00] eh, double cherry-pick legoktm :) [00:42:03] (03CR) 10Dzahn: "I amended to also remove the cumin alias. Double checking now" [puppet] - 10https://gerrit.wikimedia.org/r/657569 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff) [00:42:14] (03CR) 10Urbanecm: [C: 03+2] Don't return the status of doBlockInternal when processing block actions [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657654 (owner: 10Urbanecm) [00:42:45] Jdlrobson: looking pretty good i'd say. [00:43:34] thanks brennen [00:43:39] yeh they are tailing off [00:43:46] with caching might take a while for them to completely disappear [00:44:12] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1166.eqiad.wmnet with reason: REIMAGE [00:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:15] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE [00:44:15] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1164.eqiad.wmnet with reason: REIMAGE [00:44:16] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1168.eqiad.wmnet with reason: REIMAGE [00:44:16] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1169.eqiad.wmnet with reason: REIMAGE [00:44:16] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1170.eqiad.wmnet with reason: REIMAGE [00:44:16] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1160.eqiad.wmnet with reason: REIMAGE [00:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:16] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1165.eqiad.wmnet with reason: REIMAGE [00:44:17] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: REIMAGE [00:44:18] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1163.eqiad.wmnet with reason: REIMAGE [00:44:18] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1167.eqiad.wmnet with reason: REIMAGE [00:44:18] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: REIMAGE [00:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:18] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1161.eqiad.wmnet with reason: REIMAGE [00:44:19] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE [00:44:19] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1174.eqiad.wmnet with reason: REIMAGE [00:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:30] so many pings [00:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:48] * Jdlrobson shakes his fist and shouts "get off my lawn!" [00:44:54] heh [00:45:14] i love the reimage script anyhow [00:45:25] * mutante adds more of that - also reimaging :) [00:45:28] i didnt have to manually image a dozen hosts, its awesome [00:45:29] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:24] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: REIMAGE [00:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:27] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1164.eqiad.wmnet with reason: REIMAGE [00:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:41] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10JKatzWMF) Approved, thanks! [00:47:26] robh: 250 new Icinga checks now in PENDING, heh [00:47:45] dbs gonna be happy when they wake up [00:47:49] all new metal [00:47:53] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2007824168 and 90 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:02] dbas that is [00:48:17] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE [00:48:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1169.eqiad.wmnet with reason: REIMAGE [00:48:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1170.eqiad.wmnet with reason: REIMAGE [00:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1160.eqiad.wmnet with reason: REIMAGE [00:48:19] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1165.eqiad.wmnet with reason: REIMAGE [00:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:39] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 169804832 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:41] we're probably going to need to extend the B&C for a while, considering zuul says 13 to 20 minutes for the pending backports [00:49:16] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1168.eqiad.wmnet with reason: REIMAGE [00:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:17] robh: when doing a lot of mw i would say maybe 1 in 10 or 1 in 20 I had "failed to set downtime" from reimaging script. but if i see it I just "manually" use the downtime cookbook [00:49:17] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1158.eqiad.wmnet with reason: REIMAGE [00:49:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1163.eqiad.wmnet with reason: REIMAGE [00:49:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1167.eqiad.wmnet with reason: REIMAGE [00:49:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1157.eqiad.wmnet with reason: REIMAGE [00:49:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1161.eqiad.wmnet with reason: REIMAGE [00:49:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1174.eqiad.wmnet with reason: REIMAGE [00:49:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE [00:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:37] robh: this is using regex in a single command? [00:49:43] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 202600 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:51] used: sudo -i wmf-auto-reimage -p T267043 --new --force db1156.eqiad.wmnet db1157.eqiad.wmnet db1158.eqiad.wmnet db1159.eqiad.wmnet db1160.eqiad.wmnet db1161.eqiad.wmnet db1162.eqiad.wmnet db1163.eqiad.wmnet db1164.eqiad.wmnet db1165.eqiad.wmnet db1166.eqiad.wmnet db1167.eqiad.wmnet db1168.eqiad.wmnet db1169.eqiad.wmnet db1170.eqiad.wmnet db1171.eqiad.wmnet db1172.eqiad.wmnet db1173.eqiad.wmnet db1174.eqiad.wmnet db1175.eqiad.wmnet [00:49:51] T267043: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 [00:49:59] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 392600 and 83 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:00] gotcha, nod [00:50:01] i wasnt sure how well regex would work [00:50:12] and that was super easy to macro out into a text to cpoy paste [00:51:26] bleh, i hope it all passes i dont wanna work late! [00:51:41] i dont understand why im seeing cumin puppet runs in output [00:51:50] 00:51:21 | cumin1001.eqiad.wmnet | Puppet run completed [00:51:54] that seems... not right. [00:52:01] right, i did it slightly diferent from both options and split my terminal into 4 and run the reimage script multiple times with one host each, in parallel [00:52:59] so i dont see anything iearlier in output but seeing puppet run calls for the cumin host the script runs on isnt something ive noticed in the past [00:53:16] but if its just puppet runs it wont matter, perhaps its parsing all the log files and includes local, not really sure wtf is up with that [00:53:51] maybe it needs to run on cumin once host is done to pull puppet host keys... thats likely it but not sure [00:53:55] eh.. it is normal that it starts a puppet run on the host, but I don't see that on cumin1001 [00:54:10] ah [00:55:20] 00:54:43 | mw2374.codfw.wmnet | Polling until a Puppet sign request appears [00:55:23] 00:54:47 | mw2374.codfw.wmnet | Signed Puppet cert [00:55:29] yeah i hadnt noticed it before but it likely was always doing it [00:55:31] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:35] this is the expected part that I see when it gets to the first run [00:55:52] oh man this is so many hosts, ill split this in half next time [00:55:53] sounds reasonable [00:56:02] there is too many hosts flying by to monitor errors in real time [00:56:13] so ill have to parse for failure from report at end. [00:56:42] I stuck to 4 at a time for that reason, it seemed to be the right amount to still watch on the side while resonably quick [00:56:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2374.codfw.wmnet with reason: REIMAGE [00:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:21] every once in a while there were special cases, for example remote IPMI needed fix or a bug to report in netbox script [00:57:41] you can also look at logs afterwards though [00:58:13] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1166.eqiad.wmnet', 'db1164.eqiad.wmnet', 'db1170.eqiad.wmnet', 'db1162.eqiad.wmnet', 'db1160.eqiad.wmnet', 'db1... [00:58:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2368.codfw.wmnet with reason: REIMAGE [00:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:31] well, if it's all brandnew then i'd go higher [00:58:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2374.codfw.wmnet with reason: REIMAGE [00:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2366.codfw.wmnet with reason: REIMAGE [00:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:05] (03PS3) 10Dzahn: Remove obsolete role installserver::apt [puppet] - 10https://gerrit.wikimedia.org/r/657569 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff) [01:00:41] !log Evening B&C still in process, waiting on Zuul [01:00:43] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1804276424 and 122 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:48] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27585/" [puppet] - 10https://gerrit.wikimedia.org/r/657569 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff) [01:00:57] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7404324880 and 482 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:01:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2368.codfw.wmnet with reason: REIMAGE [01:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:41] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2376.codfw.wmnet'] ` an... [01:02:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2376.codfw.wmnet [01:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2366.codfw.wmnet with reason: REIMAGE [01:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:38] (03Merged) 10jenkins-bot: Distinguish between null continue value and unknown one [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657623 (https://phabricator.wikimedia.org/T272548) (owner: 10Anne Tomasevich) [01:05:40] (03Merged) 10jenkins-bot: A/B test output when a specific feature is being tested [extensions/DiscussionTools] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657653 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch) [01:05:42] (03Merged) 10jenkins-bot: Don't return the status of doBlockInternal when processing block actions [extensions/AbuseFilter] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657654 (owner: 10Urbanecm) [01:05:48] \o/ [01:05:54] annet: Kemayo: are you around? [01:05:58] yep! [01:05:58] 🎉 [01:06:05] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2376.codfw.wmnet [01:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:14] ok, gimme a while to pull it all to debug servers :) [01:06:37] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) [01:07:15] annet: Kemayo: Daimona_: your backports are available at mwdebug1001, please test :) [01:07:51] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10RobH) > Of which those **FAILED**: > ` > ['db1159.eqiad.wmnet', 'db1171.eqiad.wmnet', 'db1172.eqiad.wmnet', 'db1173.eqiad.wmnet', 'db1175.eqiad.wmnet'] > ` I've updated... [01:08:02] Urbanecm: Mine'll just be verifying nothing is broken, until the accompanying config patch goes up. [01:08:16] ah, I'll pull the config patch as well then, gimme a second [01:08:38] Urbanecm: testing... [01:08:42] (03CR) 10Urbanecm: [C: 03+2] Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch) [01:08:47] (03CR) 10Urbanecm: Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch) [01:08:51] (03PS2) 10Urbanecm: Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch) [01:08:56] (03CR) 10Urbanecm: [C: 03+2] Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch) [01:09:17] Testing [01:09:34] That said, I do confirm that it has no obvious breaking effects *without* the config, which is always worth checking. :D [01:09:54] Urbanecm: all looks good! [01:10:04] thanks annet, will sync [01:10:11] Kemayo: great, thanks :) [01:10:11] thanks :) [01:10:27] (03PS2) 10Dzahn: admin: update SSH key for Volker_E [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) [01:10:53] (03Merged) 10jenkins-bot: Enroll idwiki in the DiscussionTools a/b test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657691 (https://phabricator.wikimedia.org/T268191) (owner: 10DLynch) [01:11:22] Kemayo: config patch fetched to mwdebug1001 as well [01:12:05] Urbanecm: Okay, looks good. [01:12:15] thanks, will sync Kemayo [01:12:38] Took me a minute to hunt down what the Indonesian for "reply" was to verify it, which maybe I should have prepped better for. >_> [01:12:38] i assume it should be synced "backport first, config second", or should it be the other way around Kemayo ? [01:12:38] Urbanecm: Tested, works: https://www.mediawiki.org/w/index.php?title=Special:AbuseLog&wpSearchFilter=64 [01:12:57] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/WikibaseMediaInfo/: 4b0259b761681ca90b3f3039019553ddca40a5fe: Distinguish between null continue value and unknown one (T272548) (duration: 00m 59s) [01:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:00] Urbanecm: Yeah, backport first, please. [01:13:01] T272548: Adding a filter then changing tabs yields no results - https://phabricator.wikimedia.org/T272548 [01:13:04] Kemayo: ?uselang=en in your URL should fix that for you 🙂 [01:13:07] Kemayo: ack [01:13:11] annet: should be live :) [01:13:43] Urbanecm: confirmed, thanks very much! [01:14:01] no problem annet :) [01:14:28] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 70344 and 255 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:14:42] Though really, it's fairly tightly focused, so even if it's out of order it'll only cause an issue for 50% of logged in people on talk pages... if they're looking for something they don't normally expect to be there. [01:14:52] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/DiscussionTools/: 513a7861bbcf06a8ac5c29e1b9838640cbd7c628: A/B test output when a specific feature is being tested (T268191) (duration: 00m 55s) [01:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:56] T268191: Implement A/B test bucketing - https://phabricator.wikimedia.org/T268191 [01:15:30] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 12696 and 315 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:15:48] (03CR) 10Dzahn: [C: 04-1] admin: update SSH key for Volker_E (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) (owner: 10Dzahn) [01:16:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 376cba1b33dd68d40490a1498c59a4d430318ab1: Enroll idwiki in the DiscussionTools a/b test (T268191) (duration: 00m 55s) [01:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:42] Kemayo: i prefer to test everything that can reasonably be tested, humans make mistakes, and I don't want to...break all of DiscussionTools, in the worst case 🙂 [01:16:49] anyway, should be live now [01:17:07] Urbanecm: Yup, looks all good on live. Thanks! [01:17:49] no problem :) [01:18:43] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/AbuseFilter/: 7d8ab70d5b00142e8344e242dd085eb7bfa81145: Dont return the status of doBlockInternal when processing block actions (duration: 00m 59s) [01:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:51] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2374.codfw.wmnet'] ` an... [01:19:01] Daimona_: synced yours as well :) [01:19:20] so, we should be done [01:19:25] !log Evening B&C window finished [01:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:23] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2368.codfw.wmnet'] ` an... [01:20:25] Hooray! Thank you! [01:20:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment.eqiad.wmnet/Design Style Guide for VolkerE - https://phabricator.wikimedia.org/T272628 (10Dzahn) @jcrespo @Muehlenhoff or anyone. This ticket is much easier than it looks. It's existing access and only an update of an existing... [01:21:29] thanks all for the assistance earlier. [01:21:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2366.codfw.wmnet'] ` an... [01:21:51] * brennen steps away from computer. [01:22:03] (03PS3) 10Dzahn: admin: update SSH key for Volker_E [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) [01:22:16] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2374.codfw.wmnet [01:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:22] (03CR) 10DannyS712: [C: 03+1] mariadb: grant user 'phstats' additional select on phabricator_policy db [puppet] - 10https://gerrit.wikimedia.org/r/657692 (https://phabricator.wikimedia.org/T272654) (owner: 10Aklapper) [01:22:29] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2366.codfw.wmnet [01:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:37] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2368.codfw.wmnet [01:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:00] (03CR) 10Dzahn: [C: 03+1] "looks good to me, you should ask marostegui /dba to deploy these though" [puppet] - 10https://gerrit.wikimedia.org/r/657692 (https://phabricator.wikimedia.org/T272654) (owner: 10Aklapper) [01:25:43] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2374.codfw.wmnet [01:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2366.codfw.wmnet [01:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:09] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2368.codfw.wmnet [01:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:58] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:42] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:18] off - no more reimages for now [02:10:46] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10AntiCompositeNumber) Amazon has announced plans to fork Elasticsearch and Kibana under the original Apache 2.0 license (... [02:21:26] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [02:21:48] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [02:27:32] (03PS4) 10CRusnov: interface_automation: Clean up old interfaces on run [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 [02:27:34] (03CR) 10CRusnov: interface_automation: Clean up old interfaces on run (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [02:29:07] (03CR) 10CRusnov: "Thank you for reviewing as always 😊" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [02:30:20] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:42] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:44:43] (03PS1) 10Andrew Bogott: Nova vendordata: fix initial apt repo for non-Buster [puppet] - 10https://gerrit.wikimedia.org/r/657720 (https://phabricator.wikimedia.org/T271273) [02:45:49] (03CR) 10Andrew Bogott: [C: 03+2] Nova vendordata: fix initial apt repo for non-Buster [puppet] - 10https://gerrit.wikimedia.org/r/657720 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [02:51:02] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [02:51:28] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [03:09:16] (03PS1) 10Andrew Bogott: Nova vendordata: move a bunch of file writes from boot script to cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657721 (https://phabricator.wikimedia.org/T271273) [03:20:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:22:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:31:52] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:38:34] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:43] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10Tgr) Apparently there is a more community-oriented fork in the works, too: https://logz.io/blog/open-source-elasticsearc... [04:19:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:21:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:31:34] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:12] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:26] (03PS1) 10Andrew Bogott: openldap_clouddev: fix for new acme ca [puppet] - 10https://gerrit.wikimedia.org/r/657723 [04:48:14] (03CR) 10Andrew Bogott: [C: 03+2] openldap_clouddev: fix for new acme ca [puppet] - 10https://gerrit.wikimedia.org/r/657723 (owner: 10Andrew Bogott) [04:59:44] (03PS2) 10Andrew Bogott: Nova vendordata: move a bunch of file writes from boot script to cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657721 (https://phabricator.wikimedia.org/T271273) [05:07:32] (03PS3) 10Andrew Bogott: Nova vendordata: move a bunch of file writes from boot script to cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657721 (https://phabricator.wikimedia.org/T271273) [05:31:44] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:24] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db1118 weight', diff saved to https://phabricator.wikimedia.org/P13883 and previous config saved to /var/cache/conftool/dbconfig/20210122-054330-marostegui.json [05:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:52] 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Marostegui) 05Open→03Resolved a:03Marostegui Change has been applied - thanks daniel for working out the patch! [05:51:56] 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Marostegui) Actually the original patch creator was @Aklapper so thank you too! :) [05:58:58] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (releases2002), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:00:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2142 into x2 as codfw master T269324', diff saved to https://phabricator.wikimedia.org/P13884 and previous config saved to /var/cache/conftool/dbconfig/20210122-060007-marostegui.json [06:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:11] T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 [06:01:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db2143 and db2144 as x2 codfw slaves T269324', diff saved to https://phabricator.wikimedia.org/P13885 and previous config saved to /var/cache/conftool/dbconfig/20210122-060147-marostegui.json [06:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:03] (03PS1) 10Marostegui: db2133,db2078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657725 (https://phabricator.wikimedia.org/T272614) [06:16:25] !log Stop MySQL on db1117 db2133 db2078 T272614 [06:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:28] T272614: m2 codfw master crashed - https://phabricator.wikimedia.org/T272614 [06:17:24] (03CR) 10Marostegui: [C: 03+2] db2133,db2078: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/657725 (https://phabricator.wikimedia.org/T272614) (owner: 10Marostegui) [06:17:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:18:45] proxy alerts will arrive to irc [06:19:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:23:16] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [06:23:22] ^ expected [06:23:32] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 0 down 3 https://wikitech.wikimedia.org/wiki/HAProxy [06:24:40] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [06:26:38] ACKNOWLEDGEMENT - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui maintenance ongoing https://wikitech.wikimedia.org/wiki/HAProxy [06:26:38] ACKNOWLEDGEMENT - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 0 down 3 Marostegui maintenance ongoing https://wikitech.wikimedia.org/wiki/HAProxy [06:26:58] ACKNOWLEDGEMENT - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui maintenance ongoing https://wikitech.wikimedia.org/wiki/HAProxy [06:31:34] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:16] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:01] !log [wdqs] re-pooled `wdqs1013` (all caught up on lag) [06:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:45] !log [WDQS Deploy] All tests passing on canary `wdqs1003` before WDQS deploy, beginning deploy [06:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:52] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@70f9d37]: 0.3.60 [06:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:03] !log [WDQS Deploy] All tests passing on canary `wdqs1003` following canary WDQS deploy, proceeding to rest of fleet [06:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:35] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@70f9d37]: 0.3.60 (duration: 10m 43s) [06:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:30] !log [WDQS Deploy] Initial deploy complete, `query.wikidata.org` handles queries fine, proceeding to post-deploy steps [06:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:19] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [06:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:22] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [06:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:27] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [06:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:58] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [07:03:54] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [07:04:53] (03CR) 10Marostegui: "I am ok with this, but up to Luca and the Analytics Team :)" [puppet] - 10https://gerrit.wikimedia.org/r/655093 (https://phabricator.wikimedia.org/T269211) (owner: 10Bstorm) [07:20:06] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:21:36] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui) @RobH it looks like db1163 has RAID0 instead of RAID10: ` root@db1163:~# megacli -LdPdInfo -a0 Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Targe... [07:26:40] !log [WDQS Deploy] WDQS deploy complete; service is healthy [07:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:18] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:30:24] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:28] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:30:29] (03CR) 10Elukey: [C: 03+1] wikireplicas: remove query killer from dedicated replica server [puppet] - 10https://gerrit.wikimedia.org/r/655093 (https://phabricator.wikimedia.org/T269211) (owner: 10Bstorm) [07:30:42] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 2, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:35:01] (03PS6) 10Elukey: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [07:35:06] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 2, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:36:50] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:36:56] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:00] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment.eqiad.wmnet/Design Style Guide for VolkerE - https://phabricator.wikimedia.org/T272628 (10Volker_E) Thanks for observing @Dzahn! [07:41:26] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:42:08] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:08] (03PS1) 10Joal: profile::analytics::refinery::job::druid_load Bump jar version [puppet] - 10https://gerrit.wikimedia.org/r/657764 (https://phabricator.wikimedia.org/T271560) [07:44:35] elukey: --^ [07:51:15] 10SRE, 10Traffic, 10serviceops: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10JMeybohm) a:03JMeybohm Unfortunately, upstream was not very responsive on my question about adding `Cache-Control` (https://github.com/helm/chartmuseum/issues/36... [07:52:08] (03CR) 10Elukey: "There is a similar profile with "test" in the name, that runs in hadoop test, can you also update it? After that I'll merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/657764 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [07:53:03] (03PS1) 10Ryan Kemper: wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) [07:55:44] (03PS4) 10Elukey: Initial configuration of the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411) [07:55:51] (03PS1) 10Joal: Lower druid-public historical datasource backups [puppet] - 10https://gerrit.wikimedia.org/r/657766 (https://phabricator.wikimedia.org/T272670) [07:57:26] (03CR) 10Elukey: [C: 03+2] Initial configuration of the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/635751 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [07:58:53] (03PS2) 10Joal: profile::analytics::refinery::job::druid_load Bump jar version [puppet] - 10https://gerrit.wikimedia.org/r/657764 (https://phabricator.wikimedia.org/T271560) [07:59:24] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27588/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210122T0800) [08:04:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:28] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:11:47] (03PS2) 10Ryan Kemper: wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) [08:11:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:13:21] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27591/console" [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [08:13:22] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:13:28] (03CR) 10jerkins-bot: [V: 04-1] wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [08:15:02] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for netbox/apache [puppet] - 10https://gerrit.wikimedia.org/r/656187 (https://phabricator.wikimedia.org/T135991) [08:15:49] (03PS3) 10Ryan Kemper: wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) [08:16:35] (03CR) 10Ryan Kemper: "One thing I'm not sure of:" [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [08:16:42] (03PS1) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) [08:17:37] (03CR) 10jerkins-bot: [V: 04-1] wcqs: create wcqs microsite && move gui [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [08:18:12] (03PS2) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) [08:20:08] (03CR) 10Ryan Kemper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [08:21:01] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27593/console" [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [08:23:10] (03PS1) 10Muehlenhoff: Enable managed adduser.conf unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/657770 (https://phabricator.wikimedia.org/T235162) [08:26:31] (03CR) 10David Caro: [C: 03+2] config: allow using ~ for cookbook path [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 (owner: 10David Caro) [08:26:33] (03CR) 10David Caro: [C: 03+2] gitignore: add vim swap files [software/spicerack] - 10https://gerrit.wikimedia.org/r/657609 (owner: 10David Caro) [08:26:51] (03CR) 10David Caro: [V: 03+2 C: 03+2] config: allow using ~ for cookbook path [software/spicerack] - 10https://gerrit.wikimedia.org/r/657608 (owner: 10David Caro) [08:27:00] (03CR) 10David Caro: [V: 03+2 C: 03+2] gitignore: add vim swap files [software/spicerack] - 10https://gerrit.wikimedia.org/r/657609 (owner: 10David Caro) [08:27:38] (03PS1) 10Ayounsi: Remove BGP for Zayo transit in ulsfo, eqiad, eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657771 (https://phabricator.wikimedia.org/T264772) [08:28:18] (03PS1) 10JMeybohm: cache_text: Disable caching for helm-charts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/657772 (https://phabricator.wikimedia.org/T272633) [08:28:26] (03CR) 10David Caro: "Hmm... I think I should not have submitted, it seems that jenkins will..." [software/spicerack] - 10https://gerrit.wikimedia.org/r/657609 (owner: 10David Caro) [08:30:58] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff) [08:31:07] (03CR) 10Ema: [C: 03+1] cache_text: Disable caching for helm-charts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/657772 (https://phabricator.wikimedia.org/T272633) (owner: 10JMeybohm) [08:31:38] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:48] !log update puppet compiler's facts [08:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:57] (03CR) 10Ayounsi: [C: 03+2] Remove BGP for Zayo transit in ulsfo, eqiad, eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657771 (https://phabricator.wikimedia.org/T264772) (owner: 10Ayounsi) [08:35:12] !log Remove BGP for Zayo transit in ulsfo, eqiad, eqord [08:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:46] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:31] (03Merged) 10jenkins-bot: Remove BGP for Zayo transit in ulsfo, eqiad, eqord [homer/public] - 10https://gerrit.wikimedia.org/r/657771 (https://phabricator.wikimedia.org/T264772) (owner: 10Ayounsi) [08:42:57] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27597/console" [puppet] - 10https://gerrit.wikimedia.org/r/657772 (https://phabricator.wikimedia.org/T272633) (owner: 10JMeybohm) [08:43:31] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] cache_text: Disable caching for helm-charts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/657772 (https://phabricator.wikimedia.org/T272633) (owner: 10JMeybohm) [08:44:37] !log installing mutt updates for stretch [08:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:39] !log installing PIP security updates for stretch [08:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:43] 10SRE, 10serviceops: Update php-xdebug to 2.9.2 in apt.wm.o component/php72 - https://phabricator.wikimedia.org/T244716 (10hashar) [08:52:45] (03CR) 10Gehel: [C: 04-1] "See comments inline" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [08:52:51] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10amy_rc) @jcrespo My internship started on 01.01.2021 and ends on 31.03.2021. [08:53:20] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::druid_load Bump jar version [puppet] - 10https://gerrit.wikimedia.org/r/657764 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [08:53:20] 10SRE, 10FR-MW-Vagrant, 10Fundraising-Backlog, 10MediaWiki-Vagrant, 10Patch-For-Review: Package XDebug 2.9 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10hashar) [08:56:48] (03PS7) 10Elukey: profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [08:59:02] (03CR) 10Gehel: [C: 04-1] wcqs: create wcqs microsite && move gui (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [09:00:55] (03PS1) 10Gehel: query_service: remove monitoring check for UI. [puppet] - 10https://gerrit.wikimedia.org/r/657773 (https://phabricator.wikimedia.org/T271851) [09:03:25] (03PS3) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) [09:03:27] (03PS1) 10Elukey: Move hiera config for Hadoop Backup to the correct location [puppet] - 10https://gerrit.wikimedia.org/r/657774 (https://phabricator.wikimedia.org/T260411) [09:04:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10jcrespo) [09:05:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10jcrespo) @Volker_E would it be possible to verify your identity and key ownership on a videocall with your manager? [09:05:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10jcrespo) p:05Triage→03High [09:05:49] (03CR) 10Elukey: [C: 03+2] Lower druid-public historical datasource backups [puppet] - 10https://gerrit.wikimedia.org/r/657766 (https://phabricator.wikimedia.org/T272670) (owner: 10Joal) [09:08:13] (03CR) 10DCausse: [C: 03+1] query_service: remove monitoring check for UI. [puppet] - 10https://gerrit.wikimedia.org/r/657773 (https://phabricator.wikimedia.org/T271851) (owner: 10Gehel) [09:08:25] (03CR) 10Hashar: "That change broke Puppet on deploy-1002.devtools.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) (owner: 10Ottomata) [09:08:35] (03CR) 10Gehel: [C: 03+2] query_service: remove monitoring check for UI. [puppet] - 10https://gerrit.wikimedia.org/r/657773 (https://phabricator.wikimedia.org/T271851) (owner: 10Gehel) [09:13:50] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) Thanks, amy_rc, I will put that as the time to communicate with your manager to review your access status by then :-) [09:19:46] 10SRE, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10Aklapper) [09:20:53] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [09:21:47] (03CR) 10Ladsgroup: [C: 04-1] wcqs: create wcqs microsite && move gui (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657765 (https://phabricator.wikimedia.org/T271851) (owner: 10Ryan Kemper) [09:27:39] (03CR) 10Elukey: [C: 03+2] Move hiera config for Hadoop Backup to the correct location [puppet] - 10https://gerrit.wikimedia.org/r/657774 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [09:27:46] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:28:02] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:28:39] (03PS4) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) [09:30:48] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:50] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:36:36] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:37] !log kormat@cumin1001 dbctl commit (dc=all): 'Temporarily add db1088 to api group T271106', diff saved to https://phabricator.wikimedia.org/P13887 and previous config saved to /var/cache/conftool/dbconfig/20210122-094337-kormat.json [09:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:41] T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 [09:44:34] (03PS16) 10MSantos: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) [09:44:48] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1093.eqiad.wmnet with reason: Rebooting for T272255 [09:44:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1093.eqiad.wmnet with reason: Rebooting for T272255 [09:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:53] !log kormat@cumin1001 dbctl commit (dc=all): 'db1093 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13888 and previous config saved to /var/cache/conftool/dbconfig/20210122-094453-kormat.json [09:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:59] !log kormat@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13889 and previous config saved to /var/cache/conftool/dbconfig/20210122-095058-kormat.json [09:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:34] !log uploaded cairo 1.14.0-2.1+deb8u2+wmf1 to apt.wikimedia.org [09:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:24] (03PS1) 10Ayounsi: Add Lumen transit BGP in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/657777 (https://phabricator.wikimedia.org/T269808) [09:54:06] (03PS1) 10Marostegui: Revert "db2133,db2078: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657657 [09:55:33] (03CR) 10Marostegui: [C: 03+2] Revert "db2133,db2078: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/657657 (owner: 10Marostegui) [10:02:34] !log kormat@cumin1001 dbctl commit (dc=all): 'Temporarily add db1110 to api group T272255', diff saved to https://phabricator.wikimedia.org/P13890 and previous config saved to /var/cache/conftool/dbconfig/20210122-100233-kormat.json [10:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1130.eqiad.wmnet with reason: Rebooting for T272255 [10:03:03] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1130.eqiad.wmnet with reason: Rebooting for T272255 [10:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db1130 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13891 and previous config saved to /var/cache/conftool/dbconfig/20210122-100307-kormat.json [10:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:35] 10SRE, 10vm-requests: eqiad/codfw: 1 VMs requested for puppetboard - https://phabricator.wikimedia.org/T272683 (10MoritzMuehlenhoff) [10:06:02] !log kormat@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13892 and previous config saved to /var/cache/conftool/dbconfig/20210122-100602-kormat.json [10:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:35] !log kormat@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13893 and previous config saved to /var/cache/conftool/dbconfig/20210122-100734-kormat.json [10:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:08] (03PS1) 10Elukey: Add fake secrets for the Hadoop backup cluster [labs/private] - 10https://gerrit.wikimedia.org/r/657779 [10:08:11] 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Aklapper) 05Resolved→03Open >>! In T272654#6767902, @Marostegui wrote: > Change has been applied (Thanks everyone.) Hmm, https://gerrit.wikim... [10:08:43] (03CR) 10ArielGlenn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man) [10:08:45] (03PS2) 10Elukey: Add fake secrets for the Hadoop backup cluster [labs/private] - 10https://gerrit.wikimedia.org/r/657779 [10:09:00] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake secrets for the Hadoop backup cluster [labs/private] - 10https://gerrit.wikimedia.org/r/657779 (owner: 10Elukey) [10:11:27] (03CR) 10Jcrespo: [C: 03+1] "Thank you daniel for the patch, I verified on video call Volker's identity through his manager Lucy and he confirmed this is his new key." [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) (owner: 10Dzahn) [10:13:46] (03PS3) 10Aklapper: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) (owner: 10A2569875) [10:15:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [10:16:05] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host archiva1002.wikimedia.org [10:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:12] 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Marostegui) 05Open→03Resolved I forgot the m3-slave CNAME uses the hostname directly instead of the proxy, which is not nice but we can fix th... [10:18:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host archiva1002.wikimedia.org [10:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:06] !log kormat@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13894 and previous config saved to /var/cache/conftool/dbconfig/20210122-102105-kormat.json [10:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:20] 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Aklapper) Yes, works now! Does that mean https://gerrit.wikimedia.org/r/c/operations/puppet/+/657692/ should be closed or abandoned or so? Thanks! <3 [10:22:38] !log kormat@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13895 and previous config saved to /var/cache/conftool/dbconfig/20210122-102237-kormat.json [10:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:00] (03PS1) 10David Caro: wmcs.enc: added a small cli to be able to use the enc [puppet] - 10https://gerrit.wikimedia.org/r/657780 (https://phabricator.wikimedia.org/T267412) [10:24:10] 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10jcrespo) [10:27:15] (03PS1) 10Elukey: Add fake keytabs for the new hadoop worker nodes [labs/private] - 10https://gerrit.wikimedia.org/r/657781 [10:27:23] 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10jcrespo) I reached this through alerts of backups of releases2002 not working since 2021-01-21. I will disable alerts for this host until fixed (please ping me to re enabling them when fixed). [10:27:51] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake keytabs for the new hadoop worker nodes [labs/private] - 10https://gerrit.wikimedia.org/r/657781 (owner: 10Elukey) [10:29:23] (03PS1) 10Muehlenhoff: Adapt proxy setting in debmonitor nginx site for CAS [puppet] - 10https://gerrit.wikimedia.org/r/657782 [10:29:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27611/console" [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [10:30:35] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:38] (03PS1) 10Jcrespo: bacula: Ignore releases2002 backup errors until vm issues are fixed [puppet] - 10https://gerrit.wikimedia.org/r/657783 (https://phabricator.wikimedia.org/T272555) [10:30:59] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27612/console" [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [10:32:00] (03PS4) 10Jcrespo: admin: update SSH key for Volker_E [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) (owner: 10Dzahn) [10:33:39] (03CR) 10Jcrespo: [C: 03+2] admin: update SSH key for Volker_E [puppet] - 10https://gerrit.wikimedia.org/r/657705 (https://phabricator.wikimedia.org/T272628) (owner: 10Dzahn) [10:36:01] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10jcrespo) [10:36:09] !log kormat@cumin1001 dbctl commit (dc=all): 'db1093 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13897 and previous config saved to /var/cache/conftool/dbconfig/20210122-103609-kormat.json [10:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:41] !log kormat@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13898 and previous config saved to /var/cache/conftool/dbconfig/20210122-103741-kormat.json [10:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:03] (03PS5) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) [10:38:05] (03PS1) 10Elukey: profile::hadoop::worker: make client tools optional [puppet] - 10https://gerrit.wikimedia.org/r/657784 (https://phabricator.wikimedia.org/T260411) [10:39:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10jcrespo) 05Open→03Resolved a:03jcrespo @Volker_E They new key has been just deployed, it will take up to 30 minutes more or less to be deployed everywhere. Ple... [10:40:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27613/console" [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [10:42:02] (03PS2) 10Elukey: profile::hadoop::worker: make client tools optional [puppet] - 10https://gerrit.wikimedia.org/r/657784 (https://phabricator.wikimedia.org/T260411) [10:42:04] (03PS6) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) [10:42:06] (03PS2) 10Jcrespo: bacula: Ignore releases2002 backup errors until vm issues are fixed [puppet] - 10https://gerrit.wikimedia.org/r/657783 (https://phabricator.wikimedia.org/T272555) [10:43:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27614/console" [puppet] - 10https://gerrit.wikimedia.org/r/657784 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [10:44:14] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::hadoop::worker: make client tools optional [puppet] - 10https://gerrit.wikimedia.org/r/657784 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [10:44:34] (03CR) 10Jcrespo: [C: 03+2] bacula: Ignore releases2002 backup errors until vm issues are fixed [puppet] - 10https://gerrit.wikimedia.org/r/657783 (https://phabricator.wikimedia.org/T272555) (owner: 10Jcrespo) [10:45:23] elukey, not sure if we merged at the same time or your change went through? [10:45:40] it merged already at prod, so no issue [10:46:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff) [10:46:17] jynus: I think so, I saw only my change and it didn't lead to errors afaics [10:46:23] indeed :-) [10:48:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, one minor comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657780 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [10:48:51] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:49:04] yay [10:49:32] if you can fix a problem, just hide it ^ (I am joking) [10:50:29] (03PS1) 10Jcrespo: Revert "bacula: Ignore releases2002 backup errors until vm issues are fixed" [puppet] - 10https://gerrit.wikimedia.org/r/657659 [10:51:26] (03CR) 10Jcrespo: [C: 04-1] "To be deployed before closing T272555" [puppet] - 10https://gerrit.wikimedia.org/r/657659 (owner: 10Jcrespo) [10:52:45] !log kormat@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13899 and previous config saved to /var/cache/conftool/dbconfig/20210122-105244-kormat.json [10:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:45] !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db1088 from api group T271106', diff saved to https://phabricator.wikimedia.org/P13900 and previous config saved to /var/cache/conftool/dbconfig/20210122-105345-kormat.json [10:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:49] T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 [10:54:35] (03PS7) 10Elukey: Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) [10:56:31] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1134.eqiad.wmnet with reason: Rebooting for T272255 [10:56:32] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1134.eqiad.wmnet with reason: Rebooting for T272255 [10:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:37] !log kormat@cumin1001 dbctl commit (dc=all): 'db1134 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13901 and previous config saved to /var/cache/conftool/dbconfig/20210122-105636-kormat.json [10:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:05] (03PS1) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 [10:57:34] (03CR) 10jerkins-bot: [V: 04-1] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [10:58:47] (03PS2) 10David Caro: wmcs.enc: added a small cli to be able to use the enc [puppet] - 10https://gerrit.wikimedia.org/r/657780 (https://phabricator.wikimedia.org/T267412) [10:59:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Temporarily add db1127 to api group T272255', diff saved to https://phabricator.wikimedia.org/P13902 and previous config saved to /var/cache/conftool/dbconfig/20210122-105921-kormat.json [10:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1136.eqiad.wmnet with reason: Rebooting for T272255 [10:59:47] (03CR) 10David Caro: wmcs.enc: added a small cli to be able to use the enc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657780 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [10:59:47] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1136.eqiad.wmnet with reason: Rebooting for T272255 [10:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:52] !log kormat@cumin1001 dbctl commit (dc=all): 'db1136 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13903 and previous config saved to /var/cache/conftool/dbconfig/20210122-105952-kormat.json [10:59:54] (03PS2) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 [10:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:21] (03CR) 10jerkins-bot: [V: 04-1] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [11:00:25] (03CR) 10David Caro: [C: 03+2] wmcs.enc: added a small cli to be able to use the enc [puppet] - 10https://gerrit.wikimedia.org/r/657780 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [11:01:12] (03PS3) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 [11:01:32] !log kormat@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13904 and previous config saved to /var/cache/conftool/dbconfig/20210122-110132-kormat.json [11:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:43] (03CR) 10jerkins-bot: [V: 04-1] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [11:01:55] (03PS4) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 [11:02:23] (03CR) 10jerkins-bot: [V: 04-1] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [11:02:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1141.eqiad.wmnet with reason: Rebooting for T272255 [11:02:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1141.eqiad.wmnet with reason: Rebooting for T272255 [11:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:30] !log kormat@cumin1001 dbctl commit (dc=all): 'db1141 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13905 and previous config saved to /var/cache/conftool/dbconfig/20210122-110229-kormat.json [11:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:55] !log deploy cairo updates to jessie [11:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:03] !log kormat@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13906 and previous config saved to /var/cache/conftool/dbconfig/20210122-110603-kormat.json [11:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:36] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27615/" [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff) [11:09:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader2001.wikimedia.org [11:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2001.wikimedia.org [11:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:47] (03PS5) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 [11:12:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [11:13:07] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27616/console" [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [11:13:24] sigh... that should be silent :) [11:15:01] !log kormat@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13908 and previous config saved to /var/cache/conftool/dbconfig/20210122-111500-kormat.json [11:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader1001.wikimedia.org [11:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:36] !log kormat@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13909 and previous config saved to /var/cache/conftool/dbconfig/20210122-111635-kormat.json [11:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1001.wikimedia.org [11:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:13] (03PS8) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [11:19:25] (03CR) 10Jbond: sre: convert the generic reboot functions to the cookbook class API (0313 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [11:19:59] (03CR) 10Elukey: [C: 03+2] Add roles to the Hadoop Backup cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/657769 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [11:20:23] (03PS9) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [11:20:36] (03PS11) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [11:21:07] !log kormat@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13910 and previous config saved to /var/cache/conftool/dbconfig/20210122-112106-kormat.json [11:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:18] (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/657786 [11:23:43] (03PS12) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [11:24:15] !log joining restbase2009-a to cluster [11:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:24] !log kormat@cumin1001 dbctl commit (dc=all): 'es1023 depooling: enable report_host T271106', diff saved to https://phabricator.wikimedia.org/P13911 and previous config saved to /var/cache/conftool/dbconfig/20210122-112424-kormat.json [11:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:28] T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 [11:24:43] RECOVERY - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-a valid until 2022-06-15 10:34:42 +0000 (expires in 508 days) https://phabricator.wikimedia.org/T120662 [11:24:49] RECOVERY - cassandra-a service on restbase2009 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:26:29] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mwdebug1003.eqiad.wmnet [11:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:04] !log kormat@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13912 and previous config saved to /var/cache/conftool/dbconfig/20210122-113004-kormat.json [11:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:47] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.117 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [11:31:07] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:39] !log kormat@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13913 and previous config saved to /var/cache/conftool/dbconfig/20210122-113139-kormat.json [11:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:10] !log kormat@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13914 and previous config saved to /var/cache/conftool/dbconfig/20210122-113610-kormat.json [11:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:19] 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10LSobanski) p:05Triage→03Medium Sounds like a good idea. Is this to address a specific concern that came up? One thing that comes to mind is the amount, re... [11:36:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on es1023.eqiad.wmnet with reason: Reboot for T272121 [11:36:35] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1023.eqiad.wmnet with reason: Reboot for T272121 [11:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:58] ACKNOWLEDGEMENT - MariaDB Replica IO: es5 #page on es1023 is CRITICAL: CRITICAL slave_io_state could not connect Kormat Forgot to DT host first... https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:36:59] ACKNOWLEDGEMENT - MariaDB Replica Lag: es5 #page on es1023 is CRITICAL: CRITICAL slave_sql_lag could not connect Kormat Forgot to DT host first... https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:37:00] ACKNOWLEDGEMENT - MariaDB Replica SQL: es5 #page on es1023 is CRITICAL: CRITICAL slave_sql_state could not connect Kormat Forgot to DT host first... https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:37:00] ACKNOWLEDGEMENT - mysqld processes #page on es1023 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Kormat Forgot to DT host first... https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:37:41] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:49] (03PS1) 10Hnowlan: similar-users: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657789 [11:40:00] 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10jcrespo) > Is this to address a specific concern that came up? ^@mark [11:40:08] (03CR) 10Vgutierrez: [C: 04-1] "considering that you need to expose dbproxy1018 and dbproxy1019 I'd stick with two services, wikireplicas-web and wikireplicas-analytics" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [11:40:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1003.eqiad.wmnet [11:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:43] (03PS1) 10Elukey: role::analytics_backup_cluster::hadoop::master: remove analytics keytab [puppet] - 10https://gerrit.wikimedia.org/r/657790 [11:42:24] can someone give me this week's train ticket link? [11:43:08] (03CR) 10Elukey: [C: 03+2] role::analytics_backup_cluster::hadoop::master: remove analytics keytab [puppet] - 10https://gerrit.wikimedia.org/r/657790 (owner: 10Elukey) [11:43:10] 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10mark) It's purely an idea I've had for a long time, to make it immediately obvious to anyone logging in what is backed up, and what isn't. That should help to... [11:44:51] nvm. [11:45:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13915 and previous config saved to /var/cache/conftool/dbconfig/20210122-114507-kormat.json [11:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:43] !log kormat@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13916 and previous config saved to /var/cache/conftool/dbconfig/20210122-114642-kormat.json [11:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:03] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01032 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:49:05] 10SRE, 10Data-Persistence-Backup: print a list of backed up directories in the MOTD of production servers - https://phabricator.wikimedia.org/T272686 (10jcrespo) 2 notices: * This could only cover direct bacula backups (things that are indirectly backed up, like puppet or gerrit repos) or database and media ba... [11:50:09] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on es1023.eqiad.wmnet with reason: Extended reboot for T272121 [11:50:10] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on es1023.eqiad.wmnet with reason: Extended reboot for T272121 [11:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:31] PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:37] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:05] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:07] PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:14] !log kormat@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13917 and previous config saved to /var/cache/conftool/dbconfig/20210122-115113-kormat.json [11:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:27] (03CR) 10Hnowlan: [C: 03+2] similar-users: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657789 (owner: 10Hnowlan) [11:53:01] (03Merged) 10jenkins-bot: similar-users: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/657789 (owner: 10Hnowlan) [11:54:31] PROBLEM - Hadoop DataNode on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:54:31] PROBLEM - Hadoop NodeManager on an-worker1136 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:54:53] RECOVERY - Host releases2002 is UP: PING OK - Packet loss = 0%, RTA = 32.31 ms [11:54:55] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [11:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:37] RECOVERY - Hadoop DataNode on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:56:49] PROBLEM - Check systemd state on an-worker1127 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:49] PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:51] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:56:51] PROBLEM - Hadoop DataNode on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [11:58:25] sorrryyy this is me [11:58:28] new cluster, just downtimed [11:59:03] (03PS1) 10Elukey: profile::hadoop::worker: add spark2 back [puppet] - 10https://gerrit.wikimedia.org/r/657791 [11:59:18] 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10akosiaris) I think the following explains it: ` root@releases2002:~# ip addr ls 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00... [12:00:11] !log kormat@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13918 and previous config saved to /var/cache/conftool/dbconfig/20210122-120011-kormat.json [12:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27617/console" [puppet] - 10https://gerrit.wikimedia.org/r/657791 (owner: 10Elukey) [12:00:54] 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10akosiaris) 05Open→03Resolved a:03akosiaris Anyway, s/ens5/ens6/ in /etc/network/interfaces and the issue has been fixed. I was wondering whether it makes sense to invest time to "fix" this but having me... [12:02:10] (03Abandoned) 10Elukey: profile::hadoop::worker: add spark2 back [puppet] - 10https://gerrit.wikimedia.org/r/657791 (owner: 10Elukey) [12:02:58] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Aklapper) >>! In T272489#6766201, @jcrespo wrote: > on the Engineering's handbook To avoid misunderstandings and to allow me to get a better overview of onboarding docs, which "Engineering's handbook... [12:03:56] (03PS1) 10Muehlenhoff: debmonitor: Also allow localhost [puppet] - 10https://gerrit.wikimedia.org/r/657793 [12:05:17] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10jcrespo) @Aklapper I pinned down the source of poorly documented request to https://office.wikimedia.org/w/index.php?title=Guide_for_new_engineering_staff&diff=prev&oldid=284690 and tried to fix it t... [12:06:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man) [12:08:48] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1135,1137].eqiad.wmnet [12:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:29] (03PS10) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [12:10:39] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1135,1137].eqiad.wmnet [12:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:06] 10SRE, 10vm-requests: eqiad/codfw: 1 VMs requested for puppetboard - https://phabricator.wikimedia.org/T272683 (10jbond) Looks good to me [12:11:47] (03PS13) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [12:12:29] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [12:12:51] RECOVERY - Hadoop DataNode on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:14:18] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [12:16:33] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:19:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:26:42] (03PS1) 10Muehlenhoff: debmonitor: Don't include debmonitor_static for the internal listener [puppet] - 10https://gerrit.wikimedia.org/r/657795 [12:30:44] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:33:03] 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10jcrespo) 05Resolved→03Open Let me reopen for reenabling backup monitoring (even if main issue has been fixed). Thanks @akosiaris for the help here. [12:33:24] 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10jcrespo) a:05akosiaris→03jcrespo [12:33:39] !log volker-e@deploy1001 Started deploy [design/style-guide@9a811b8]: Deploy design/style-guide: 9a811b8 Add Language selectors to component overview Sketch document (#424) [12:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:46] !log volker-e@deploy1001 Finished deploy [design/style-guide@9a811b8]: Deploy design/style-guide: 9a811b8 Add Language selectors to component overview Sketch document (#424) (duration: 00m 07s) [12:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:56] (03PS11) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [12:35:18] !log kormat@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 25%: Reboot T272121', diff saved to https://phabricator.wikimedia.org/P13919 and previous config saved to /var/cache/conftool/dbconfig/20210122-123518-kormat.json [12:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:56] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:19] (03PS14) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [12:38:31] (03PS2) 10Jcrespo: Revert "bacula: Ignore releases2002 backup errors until vm issues are fixed" [puppet] - 10https://gerrit.wikimedia.org/r/657659 [12:38:33] !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db1127 from api group T272255', diff saved to https://phabricator.wikimedia.org/P13920 and previous config saved to /var/cache/conftool/dbconfig/20210122-123832-kormat.json [12:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:47] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host puppetboard1002.eqiad.wmnet [12:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:50] (03PS1) 10Elukey: profile::hadoop::backup::namenode: add dep to analytics-admin [puppet] - 10https://gerrit.wikimedia.org/r/657797 [12:39:42] (03PS15) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [12:40:17] (03CR) 10Jcrespo: [C: 03+2] Revert "bacula: Ignore releases2002 backup errors until vm issues are fixed" [puppet] - 10https://gerrit.wikimedia.org/r/657659 (owner: 10Jcrespo) [12:41:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27619/console" [puppet] - 10https://gerrit.wikimedia.org/r/657791 (owner: 10Elukey) [12:41:21] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [12:42:04] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [12:43:11] !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db1110 from api group T272255', diff saved to https://phabricator.wikimedia.org/P13921 and previous config saved to /var/cache/conftool/dbconfig/20210122-124310-kormat.json [12:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27620/console" [puppet] - 10https://gerrit.wikimedia.org/r/657797 (owner: 10Elukey) [12:46:44] (03PS12) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [12:46:51] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::hadoop::backup::namenode: add dep to analytics-admin [puppet] - 10https://gerrit.wikimedia.org/r/657797 (owner: 10Elukey) [12:47:09] 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10jcrespo) 05Open→03Resolved a:05jcrespo→03akosiaris Backing up 0 bytes was unsurprisingly fast :-) Thanks again to both of you. [12:47:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1149.eqiad.wmnet with reason: Rebooting for T272255 [12:47:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1149.eqiad.wmnet with reason: Rebooting for T272255 [12:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:49] !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 depooling: Rebooting for T272255', diff saved to https://phabricator.wikimedia.org/P13922 and previous config saved to /var/cache/conftool/dbconfig/20210122-124748-kormat.json [12:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:51] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [12:50:06] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.429 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [12:50:22] !log kormat@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 50%: Reboot T272121', diff saved to https://phabricator.wikimedia.org/P13923 and previous config saved to /var/cache/conftool/dbconfig/20210122-125021-kormat.json [12:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:00] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Aklapper) Ah, thanks! <3 [12:52:24] (03CR) 10Muehlenhoff: role::ml-serve: Add ml-serve machine role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [12:53:05] (03PS1) 10Elukey: Use profile::standard::admin_groups in Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/657800 [12:53:06] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host puppetboard1002.eqiad.wmnet [12:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:54] (03CR) 10Elukey: [C: 03+2] Use profile::standard::admin_groups in Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/657800 (owner: 10Elukey) [12:54:29] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host puppetboard2002.codfw.wmnet [12:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:33] (03PS1) 10Jcrespo: mariadb-backups: Add docs for logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T146149) [12:56:13] (03PS1) 10Jbond: sre.misc-clusters.scb: create batch action cook book for scb [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 [12:56:46] (03CR) 10Jcrespo: "🙈" [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T146149) (owner: 10Jcrespo) [12:59:29] (03CR) 10jerkins-bot: [V: 04-1] sre.misc-clusters.scb: create batch action cook book for scb [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [12:59:33] !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 25%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13924 and previous config saved to /var/cache/conftool/dbconfig/20210122-125932-kormat.json [12:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:25] !log kormat@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 75%: Reboot T272121', diff saved to https://phabricator.wikimedia.org/P13925 and previous config saved to /var/cache/conftool/dbconfig/20210122-130525-kormat.json [13:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:28] (03PS1) 10Elukey: hadoop: make Yarn Spark Shuffle optional [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) [13:06:47] (03Abandoned) 10Jbond: sre.misc-clusters.scb: create batch action cook book for scb [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [13:07:30] (03Restored) 10Jbond: sre.misc-clusters.scb: create batch action cook book for scb [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [13:07:37] (03PS2) 10Elukey: hadoop: make Yarn Spark Shuffle optional [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) [13:08:08] (03CR) 10Jbond: sre.misc-clusters.scb: create batch action cook book for scb (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [13:09:04] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27621/console" [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [13:10:06] (03PS2) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T146149) [13:11:34] (03PS13) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [13:11:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27622/console" [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [13:11:39] (03CR) 10Marostegui: [C: 03+2] mariadb: grant user 'phstats' additional select on phabricator_policy db [puppet] - 10https://gerrit.wikimedia.org/r/657692 (https://phabricator.wikimedia.org/T272654) (owner: 10Aklapper) [13:12:02] 10SRE, 10DBA, 10Phabricator, 10Patch-For-Review: Grant phstats user SELECT rights for phabricator_policy database - https://phabricator.wikimedia.org/T272654 (10Marostegui) Just merged it! :) [13:13:37] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [13:14:36] (03PS3) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929) [13:14:36] !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 50%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13926 and previous config saved to /var/cache/conftool/dbconfig/20210122-131436-kormat.json [13:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:19] (03CR) 10Elukey: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [13:19:53] (03Abandoned) 10Jcrespo: mariadb-backups: Setup x2 production backups [puppet] - 10https://gerrit.wikimedia.org/r/649820 (https://phabricator.wikimedia.org/T269324) (owner: 10Jcrespo) [13:20:29] !log kormat@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 100%: Reboot T272121', diff saved to https://phabricator.wikimedia.org/P13927 and previous config saved to /var/cache/conftool/dbconfig/20210122-132028-kormat.json [13:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:16] (03PS3) 10Elukey: hadoop: make Yarn Spark Shuffle optional [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) [13:21:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host puppetboard2002.codfw.wmnet [13:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:43] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27623/console" [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [13:24:54] (03PS4) 10Elukey: hadoop: make Yarn Spark Shuffle optional [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) [13:25:06] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27624/console" [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [13:28:17] (03CR) 10Elukey: [V: 03+1 C: 03+2] hadoop: make Yarn Spark Shuffle optional [puppet] - 10https://gerrit.wikimedia.org/r/657805 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [13:29:41] !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 75%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13929 and previous config saved to /var/cache/conftool/dbconfig/20210122-132939-kormat.json [13:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:02] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:30:04] RECOVERY - Hadoop NodeManager on an-worker1136 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P13930 and previous config saved to /var/cache/conftool/dbconfig/20210122-133044-marostegui.json [13:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:00] PROBLEM - Check systemd state on ms-be1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:06] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:06] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:15] !log Stop replication on db1121 [13:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:20] RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:30] RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:30] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:02] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004848 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:33:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1121', diff saved to https://phabricator.wikimedia.org/P13931 and previous config saved to /var/cache/conftool/dbconfig/20210122-133341-marostegui.json [13:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:14] (03PS1) 10Elukey: profile::prometheus::analytics: add metrics for the Backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/657810 (https://phabricator.wikimedia.org/T260411) [13:35:40] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.925 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [13:36:30] (03PS1) 10Muehlenhoff: Add puppetboard[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/657812 (https://phabricator.wikimedia.org/T264276) [13:37:38] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:22] (03CR) 10Elukey: [C: 03+2] profile::prometheus::analytics: add metrics for the Backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/657810 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [13:40:35] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6736006, @Gilles wrote: > Awesome, glad to see that the bisecting paid off! Still 361 commits between thos... [13:44:45] !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 100%: Reboot T272255', diff saved to https://phabricator.wikimedia.org/P13932 and previous config saved to /var/cache/conftool/dbconfig/20210122-134444-kormat.json [13:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:17] (03PS6) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 [13:45:54] (03CR) 10Klausman: role::ml-serve: Add ml-serve machine role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [13:53:19] (03CR) 10Elukey: "Is role::ml_serve.pp already in puppet? Also left some comments :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [13:57:16] RECOVERY - Check systemd state on ms-be1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:35] (03CR) 10Muehlenhoff: [C: 03+2] Add puppetboard[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/657812 (https://phabricator.wikimedia.org/T264276) (owner: 10Muehlenhoff) [14:04:07] (03PS1) 10Elukey: role::analytics_backup_cluster::hadoop::master: use kerberos [puppet] - 10https://gerrit.wikimedia.org/r/657816 [14:05:34] (03CR) 10Elukey: [C: 03+2] role::analytics_backup_cluster::hadoop::master: use kerberos [puppet] - 10https://gerrit.wikimedia.org/r/657816 (owner: 10Elukey) [14:09:12] (03PS1) 10Ottomata: Revert "Render kafka cluster connection info in helmfile-defaults/general-*.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/657827 [14:10:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "Render kafka cluster connection info in helmfile-defaults/general-*.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/657827 (owner: 10Ottomata) [14:15:22] (03PS2) 10Ottomata: Revert "Render kafka cluster connection info in helmfile-defaults/general-*.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/657827 [14:16:00] (03PS3) 10Ottomata: Revert "Render kafka cluster connection info in helmfile-defaults..." [puppet] - 10https://gerrit.wikimedia.org/r/657827 [14:17:09] (03CR) 10JMeybohm: [C: 03+1] Revert "Render kafka cluster connection info in helmfile-defaults..." [puppet] - 10https://gerrit.wikimedia.org/r/657827 (owner: 10Ottomata) [14:17:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "Render kafka cluster connection info in helmfile-defaults..." [puppet] - 10https://gerrit.wikimedia.org/r/657827 (owner: 10Ottomata) [14:18:45] (03PS1) 10Jcrespo: mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559) [14:19:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:20:20] (03PS7) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 [14:20:48] (03CR) 10jerkins-bot: [V: 04-1] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [14:20:56] (03CR) 10Klausman: "> Patch Set 6:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [14:21:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:24:19] (03PS1) 10Jcrespo: mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559) [14:26:30] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jcrespo) [14:27:51] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10jcrespo) [14:29:01] (03PS4) 10Ottomata: Revert "Render kafka cluster connection info in helmfile-defaults..." [puppet] - 10https://gerrit.wikimedia.org/r/657827 [14:29:29] (03CR) 10Jbond: "updated thanks" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [14:30:16] (03PS8) 10Klausman: role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 [14:30:25] (03PS2) 10Jbond: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 [14:30:35] (03CR) 10Alexandros Kosiaris: [C: 03+1] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/657786 (owner: 10Muehlenhoff) [14:30:38] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:56] 10SRE, 10Traffic, 10serviceops: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10CDanis) >>! In T272633#6768020, @JMeybohm wrote: > Unfortunately, upstream was not very responsive on my question about adding `Cache-Control` (https://github.com/... [14:37:07] (03CR) 10jerkins-bot: [V: 04-1] icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [14:37:22] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:28] (03PS4) 10Jbond: dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 [14:43:35] (03CR) 10Jbond: "updated thanks" (034 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [14:45:33] (03CR) 10jerkins-bot: [V: 04-1] dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [14:46:45] (03CR) 10Ottomata: [C: 03+2] Revert "Render kafka cluster connection info in helmfile-defaults..." [puppet] - 10https://gerrit.wikimedia.org/r/657827 (owner: 10Ottomata) [14:58:30] (03PS4) 10Andrew Bogott: Nova vendordata: move a bunch of file writes from boot script to cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657721 (https://phabricator.wikimedia.org/T271273) [14:59:22] (03PS1) 10Gehel: query_service: fix failing WDQS SPARQL icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) [15:03:22] (03CR) 10Elukey: [C: 03+1] "As initial set up looks good to me. I had a chat with Tobias about having a different namespace for the cluster, like 'machine_learning' e" [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [15:07:06] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: ensure HDFS directories [puppet] - 10https://gerrit.wikimedia.org/r/657850 [15:09:56] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: ensure HDFS directories [puppet] - 10https://gerrit.wikimedia.org/r/657850 (owner: 10Elukey) [15:11:04] RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.033 second response time on 10.192.48.54 port 9042 https://phabricator.wikimedia.org/T93886 [15:11:46] RECOVERY - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-b valid until 2022-06-15 10:34:44 +0000 (expires in 508 days) https://phabricator.wikimedia.org/T120662 [15:11:48] (03PS1) 10Ottomata: eventstreams - move client_ip_connection_limit setting to helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/657852 (https://phabricator.wikimedia.org/T269160) [15:11:52] RECOVERY - cassandra-b service on restbase2009 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:12:19] (03CR) 10Klausman: [C: 03+2] role::ml-serve: Add ml-serve machine role [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [15:13:14] (03CR) 10Elukey: "didn't add it in time but https://puppet-compiler.wmflabs.org/compiler1003/27626/ looks good as well :)" [puppet] - 10https://gerrit.wikimedia.org/r/657785 (owner: 10Klausman) [15:16:35] (03CR) 10Gehel: query_service: fix failing WDQS SPARQL icinga check. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel) [15:18:08] (03CR) 10ZPapierski: [C: 03+1] query_service: fix failing WDQS SPARQL icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel) [15:23:29] (03CR) 10Ottomata: [C: 03+2] eventstreams - move client_ip_connection_limit setting to helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/657852 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [15:23:41] (03CR) 10DCausse: [C: 03+1] query_service: fix failing WDQS SPARQL icinga check. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel) [15:24:21] !log installing puppetboard2002 [15:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:46] (03PS1) 10Alexandros Kosiaris: Add a linkrecommendation-external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) [15:25:00] (03CR) 10Gehel: query_service: fix failing WDQS SPARQL icinga check. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel) [15:25:53] (03PS2) 10Gehel: query_service: fix failing WDQS SPARQL icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) [15:26:01] (03PS3) 10Gehel: query_service: fix failing WDQS SPARQL icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) [15:29:00] (03PS1) 10Elukey: role::ml_serve: fix hiera filename [puppet] - 10https://gerrit.wikimedia.org/r/657857 [15:29:38] (03CR) 10Elukey: [C: 03+2] role::ml_serve: fix hiera filename [puppet] - 10https://gerrit.wikimedia.org/r/657857 (owner: 10Elukey) [15:29:48] (03CR) 10Gehel: [C: 03+2] query_service: fix failing WDQS SPARQL icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/657848 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel) [15:30:23] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:02] (03PS2) 10Bstorm: wikireplicas: remove query killer from dedicated replica server [puppet] - 10https://gerrit.wikimedia.org/r/655093 (https://phabricator.wikimedia.org/T269211) [15:31:40] (03CR) 10Jbond: "recheck" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [15:33:07] (03CR) 10Alexandros Kosiaris: "Adding the team for input on the approach:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [15:34:14] (03PS1) 10Elukey: cumin: add aliases for the ml-serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/657858 [15:34:17] (03CR) 10jerkins-bot: [V: 04-1] dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [15:36:13] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:50] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:14] PROBLEM - WDQS SPARQL on wdqs1003 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:14] PROBLEM - WDQS SPARQL on wdqs1008 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:14] PROBLEM - WDQS SPARQL on wdqs1011 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:15] PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:16] PROBLEM - WDQS SPARQL on wdqs2005 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:17] PROBLEM - WDQS SPARQL on wdqs2006 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:18] PROBLEM - WDQS SPARQL on wdqs2008 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:19] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:20] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:21] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:22] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:23] PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:23] 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) Adding https://metallb.universe.tf/ as a potential solution as well. [15:37:24] PROBLEM - WDQS SPARQL on wdqs2001 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:25] PROBLEM - WDQS SPARQL on wdqs2002 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:26] PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: The command defined for service WDQS SPARQL does not exist https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:32] WDQS SPARQL errors are me, checking for a typo [15:38:18] (03CR) 10Bstorm: [C: 03+2] "This probably won't *remove* anything from the existing setup, but when things are rebuild multi-instance, it won't get installed on it." [puppet] - 10https://gerrit.wikimedia.org/r/655093 (https://phabricator.wikimedia.org/T269211) (owner: 10Bstorm) [15:38:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, picking a canary won't hurt, though?" [puppet] - 10https://gerrit.wikimedia.org/r/657858 (owner: 10Elukey) [15:38:51] 10SRE, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10Dwisehaupt) This has been pushed to the frack puppet instance and will be rolling across hosts in the next few minutes. [15:39:47] (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656187 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:40:08] (03CR) 10CRusnov: [C: 03+2] interface_automation: Clean up old interfaces on run [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [15:40:36] (03CR) 10Elukey: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/657858 (owner: 10Elukey) [15:40:52] !log installing puppetboard1002 [15:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:15] (03PS1) 10Andrew Bogott: nova vendordata: fix issues with the cloud-init apt config [puppet] - 10https://gerrit.wikimedia.org/r/657859 (https://phabricator.wikimedia.org/T271273) [15:41:17] (03CR) 10Ema: [C: 04-1] varnish: Set debug=1 in X-Analytics header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli) [15:41:25] (03PS4) 10CRusnov: interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 [15:41:30] (03PS2) 10Elukey: cumin: add aliases for the ml-serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/657858 [15:41:36] (03CR) 10Andrew Bogott: [C: 03+2] Nova vendordata: move a bunch of file writes from boot script to cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/657721 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [15:42:19] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: fix issues with the cloud-init apt config [puppet] - 10https://gerrit.wikimedia.org/r/657859 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [15:42:26] (03PS2) 10Andrew Bogott: nova vendordata: fix issues with the cloud-init apt config [puppet] - 10https://gerrit.wikimedia.org/r/657859 (https://phabricator.wikimedia.org/T271273) [15:43:15] (03PS1) 10Ottomata: eventstreams-internal - only 1 replica needed in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/657860 (https://phabricator.wikimedia.org/T269160) [15:44:21] (03CR) 10Elukey: [C: 03+2] cumin: add aliases for the ml-serve cluster [puppet] - 10https://gerrit.wikimedia.org/r/657858 (owner: 10Elukey) [15:45:59] (03CR) 10Ottomata: [C: 03+2] eventstreams-internal - only 1 replica needed in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/657860 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [15:51:15] (03PS1) 10Gehel: query_service: use a SPARQL query that is agnostic of the updater. [puppet] - 10https://gerrit.wikimedia.org/r/657861 (https://phabricator.wikimedia.org/T272713) [15:54:17] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [15:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:03] (03CR) 10DCausse: [C: 03+1] query_service: use a SPARQL query that is agnostic of the updater. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657861 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel) [15:56:13] (03PS2) 10Gehel: query_service: use a SPARQL query that is agnostic of the updater. [puppet] - 10https://gerrit.wikimedia.org/r/657861 (https://phabricator.wikimedia.org/T272713) [15:58:19] (03CR) 10Gehel: [C: 03+2] query_service: use a SPARQL query that is agnostic of the updater. [puppet] - 10https://gerrit.wikimedia.org/r/657861 (https://phabricator.wikimedia.org/T272713) (owner: 10Gehel) [16:03:02] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Bstorm) [16:04:25] (03CR) 10Ottomata: [C: 03+1] eventlogging: Remove multiple unused modules [puppet] - 10https://gerrit.wikimedia.org/r/657538 (https://phabricator.wikimedia.org/T272559) (owner: 10Ladsgroup) [16:09:53] (03CR) 10CRusnov: [C: 03+2] interface_automation: Fix `interface` reference in IP address assignment [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657207 (owner: 10CRusnov) [16:11:23] (03PS2) 10Jbond: sre.misc-clusters.scb: create batch action cook book for scb [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 [16:13:05] (03PS3) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 [16:13:17] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Bstorm) [16:13:36] (03PS1) 10Klausman: admin: Let the ml-team admins become root on ml-serve* [puppet] - 10https://gerrit.wikimedia.org/r/657864 [16:16:53] (03CR) 10jerkins-bot: [V: 04-1] sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [16:19:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:19:34] !log restart of backup source hosts on codfw T271913 [16:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:52] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:12] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.069 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:12] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.117 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:12] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:13] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.227 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:14] RECOVERY - WDQS SPARQL on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:15] RECOVERY - WDQS SPARQL on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:16] RECOVERY - WDQS SPARQL on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.193 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:17] RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:44] (03PS1) 10Andrew Bogott: Update eqiad1 designate to version 'train' [puppet] - 10https://gerrit.wikimedia.org/r/657866 (https://phabricator.wikimedia.org/T261135) [16:26:31] (03CR) 10Andrew Bogott: [C: 03+2] Update eqiad1 designate to version 'train' [puppet] - 10https://gerrit.wikimedia.org/r/657866 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [16:29:49] 10SRE, 10vm-requests: eqiad: 1 of VMs requested for Cumin - https://phabricator.wikimedia.org/T272349 (10MoritzMuehlenhoff) 05Open→03Resolved This has been created [16:30:03] 10SRE, 10vm-requests: eqiad/codfw: 1 VMs requested for puppetboard - https://phabricator.wikimedia.org/T272683 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff These have been created. [16:30:23] (03CR) 10Razzi: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [16:30:47] (03PS4) 10Razzi: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) [16:30:58] (03PS1) 10David Caro: wmcs.wmcs-enc-cli: Format data before sending [puppet] - 10https://gerrit.wikimedia.org/r/657869 [16:31:07] (03PS1) 10Andrew Bogott: Move eqiad1/horizon to openstack train [puppet] - 10https://gerrit.wikimedia.org/r/657870 (https://phabricator.wikimedia.org/T261134) [16:31:32] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.wmcs-enc-cli: Format data before sending [puppet] - 10https://gerrit.wikimedia.org/r/657869 (owner: 10David Caro) [16:31:47] (03CR) 10Andrew Bogott: [C: 03+2] Move eqiad1/horizon to openstack train [puppet] - 10https://gerrit.wikimedia.org/r/657870 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [16:33:26] 10SRE, 10Phatality, 10observability, 10Developer Productivity: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes - https://phabricator.wikimedia.org/T237706 (10colewhite) 05Open→03Invalid We have upgraded to Kibana 7 which renders this task invalid. There is still... [16:33:44] PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:54] 10SRE, 10ops-eqiad, 10DC-Ops: frdev1001 ILO inaccessible - https://phabricator.wikimedia.org/T267969 (10Cmjohnson) 05Open→03Resolved It turns out all the ILO network settings were reset, fixed on the console and ILO is accessible. [16:40:52] !log replacing optics/fiber pfw3a-eqiad:xe-0/0/17 and fasw-c1a-eqiad:xe-0/2/0 T271295 [16:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:58] T271295: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 [16:42:22] (03CR) 10Elukey: "This seems good to me, but I think that it would be better if we track this in a task (tagged with SRE-Access-Request) so people are aware" [puppet] - 10https://gerrit.wikimedia.org/r/657864 (owner: 10Klausman) [16:42:59] (03CR) 10Bstorm: "> Patch Set 6: Code-Review-1" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [16:44:08] !log ppchelko@deploy1001 Started deploy [restbase/deploy@e54225d]: T270411 T270415 T270281 T270277 [16:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:20] T270277: Add diqwiktionary to RESTBase - https://phabricator.wikimedia.org/T270277 [16:44:20] T270415: Add niawiki to RESTBase - https://phabricator.wikimedia.org/T270415 [16:44:20] T270411: Add niawiktionary to RESTBase - https://phabricator.wikimedia.org/T270411 [16:44:21] T270281: Add bclwiktionary to RESTBase - https://phabricator.wikimedia.org/T270281 [16:44:34] (03CR) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [16:47:10] (03CR) 10David Caro: [C: 03+2] wmcs.wmcs-enc-cli: Format data before sending [puppet] - 10https://gerrit.wikimedia.org/r/657869 (owner: 10David Caro) [16:48:48] (03PS2) 10Klausman: admin: Let the ml-team admins become root on ml-serve* [puppet] - 10https://gerrit.wikimedia.org/r/657864 (https://phabricator.wikimedia.org/T272687) [16:49:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:49:18] (03CR) 10jerkins-bot: [V: 04-1] admin: Let the ml-team admins become root on ml-serve* [puppet] - 10https://gerrit.wikimedia.org/r/657864 (https://phabricator.wikimedia.org/T272687) (owner: 10Klausman) [16:49:40] RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:49] (03CR) 10Klausman: "> This seems good to me, but I think that it would be better if we track this in a task (tagged with SRE-Access-Request) so people are awa" [puppet] - 10https://gerrit.wikimedia.org/r/657864 (https://phabricator.wikimedia.org/T272687) (owner: 10Klausman) [16:50:23] (03PS3) 10Klausman: admin: Let the ml-team admins become root on ml-serve* [puppet] - 10https://gerrit.wikimedia.org/r/657864 (https://phabricator.wikimedia.org/T272687) [16:51:00] (03CR) 10Klausman: [C: 03+2] admin: Let the ml-team admins become root on ml-serve* [puppet] - 10https://gerrit.wikimedia.org/r/657864 (https://phabricator.wikimedia.org/T272687) (owner: 10Klausman) [16:51:52] 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10Cmjohnson) 05Open→03Resolved Everything is replaced, kept the same cable number if that matters. [16:57:41] (03PS7) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) [17:01:43] (03PS1) 10Mforns: analytics:refinery: Bump up Refine/DruidLoad jar versions to 0.0.145 [puppet] - 10https://gerrit.wikimedia.org/r/657875 [17:02:40] (03CR) 10Bstorm: "> Patch Set 6:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [17:05:59] (03CR) 10Ottomata: [C: 03+2] analytics:refinery: Bump up Refine/DruidLoad jar versions to 0.0.145 [puppet] - 10https://gerrit.wikimedia.org/r/657875 (owner: 10Mforns) [17:12:28] (03PS1) 10Jbond: sre.apt.audit: add script to audit manually installed package [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 [17:13:04] !log mforns@deploy1001 Started deploy [analytics/refinery@eea071d]: Extra bug-fix train [analytics/refinery@eea071def90a8a856b1e04dda23b77a850134253] [17:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:31] (03CR) 10Vgutierrez: "> Patch Set 7:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [17:19:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:32] 10SRE, 10SRE-Access-Requests: Replace SSH key for cluster access for VolkerE - https://phabricator.wikimedia.org/T272628 (10Dzahn) @jcrespo Thank you :) [17:19:50] 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10Dzahn) @akosiaris Thank you!:) [17:19:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:21:51] (03PS2) 10Jbond: sre.apt.audit: add script to audit manually installed package [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 [17:22:36] (03CR) 10Dzahn: "thanks for the fix .. and the revert" [puppet] - 10https://gerrit.wikimedia.org/r/657783 (https://phabricator.wikimedia.org/T272555) (owner: 10Jcrespo) [17:23:07] !log mforns@deploy1001 Finished deploy [analytics/refinery@eea071d]: Extra bug-fix train [analytics/refinery@eea071def90a8a856b1e04dda23b77a850134253] (duration: 10m 03s) [17:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:27] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:24:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:25:20] (03PS3) 10Jbond: sre.apt.audit: add script to audit manually installed package [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 [17:25:45] RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:28:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:40] !log mforns@deploy1001 Started deploy [analytics/refinery@eea071d] (thin): Extra bug-fix train THIN [analytics/refinery@eea071def90a8a856b1e04dda23b77a850134253] [17:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:48] !log mforns@deploy1001 Finished deploy [analytics/refinery@eea071d] (thin): Extra bug-fix train THIN [analytics/refinery@eea071def90a8a856b1e04dda23b77a850134253] (duration: 00m 07s) [17:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:31] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:30:46] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:31:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:31:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:32:52] (03CR) 10Bstorm: "> Patch Set 7:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [17:35:19] (03CR) 10Vgutierrez: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [17:38:47] (03CR) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [17:49:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2358.codfw.wmnet with reason: REIMAGE [17:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:45] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@e54225d]: T270411 T270415 T270281 T270277 (duration: 65m 37s) [17:49:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2360.codfw.wmnet with reason: REIMAGE [17:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:50] T270277: Add diqwiktionary to RESTBase - https://phabricator.wikimedia.org/T270277 [17:49:51] T270415: Add niawiki to RESTBase - https://phabricator.wikimedia.org/T270415 [17:49:51] T270411: Add niawiktionary to RESTBase - https://phabricator.wikimedia.org/T270411 [17:49:51] T270281: Add bclwiktionary to RESTBase - https://phabricator.wikimedia.org/T270281 [17:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2362.codfw.wmnet with reason: REIMAGE [17:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2364.codfw.wmnet with reason: REIMAGE [17:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:42] (03Abandoned) 10Ottomata: eventgate-{main,analytics,logging-external} - bump to 2020-12-02-151648-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/644922 (https://phabricator.wikimedia.org/T266573) (owner: 10Ottomata) [17:52:07] !log releases2001 - create new partition table with fdisk, make ext4 filesystem on /dev/vdb1 [17:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2362.codfw.wmnet with reason: REIMAGE [17:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:23] (03PS1) 10Ottomata: eventgate-main - bump to 2021-01-22-173634-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/657885 (https://phabricator.wikimedia.org/T262226) [17:52:38] (03CR) 10Ottomata: "To deploy on Monday." [deployment-charts] - 10https://gerrit.wikimedia.org/r/657885 (https://phabricator.wikimedia.org/T262226) (owner: 10Ottomata) [17:53:59] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2358.codfw.wmnet with reason: REIMAGE [17:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:47] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2360.codfw.wmnet with reason: REIMAGE [17:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:35] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2364.codfw.wmnet with reason: REIMAGE [17:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:10] (03PS4) 10Jbond: (WIP) sre.apt.audit: produce a report of manually packages [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 [17:57:43] !log releases1002 (releases.wm.org active backend) - rebooting - hopefully it does not run into T272555 but if it does now it's known how to fix [17:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:47] T272555: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 [17:58:51] PROBLEM - PHP7 rendering on mw2360 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:59:03] (03CR) 10jerkins-bot: [V: 04-1] (WIP) sre.apt.audit: produce a report of manually packages [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 (owner: 10Jbond) [17:59:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mw2360.codfw.wmnet with reason: new install on buster [17:59:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mw2360.codfw.wmnet with reason: new install on buster [17:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] PROBLEM - Host releases1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on releases1002.eqiad.wmnet with reason: fixing networking - added disk [18:01:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on releases1002.eqiad.wmnet with reason: fixing networking - added disk [18:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:37] mutante: same issue as with releases2002? [18:02:19] RECOVERY - Host releases1002 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [18:02:39] marxarelli: yes, same issue. but thanks to alex I know the fix and applied it [18:03:06] ack [18:03:07] !log releases1002 - replaced ens5 with ens6 in /etc/network/interfaaces and rebooted again [18:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:19] marxarelli: releases2002 already has the new disk mounted at /srv/docker [18:05:09] and ... same now on releases1002. created ext4 fs and mounted [18:05:11] got it. that works. i'll add the right `profile::docker::settings` then [18:05:26] ty! [18:05:55] I think when we replace these VMs, we will just make larger disks, so we don't need to worry more about it [18:06:05] PROBLEM - mediawiki-installation DSH group on mw2364 is CRITICAL: Host mw2364 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:06:06] except adding to fstab so they don't disappear after reboot.. doing that [18:06:44] 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10calbon) kbazira needs access too [18:07:40] it looks like 147 / 140 free after i told Ganeti i want 150G, it's 1000 vs 1024 [18:07:53] RECOVERY - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is OK: TCP OK - 0.034 second response time on 10.192.48.55 port 9042 https://phabricator.wikimedia.org/T93886 [18:08:17] PROBLEM - Apache HTTP on mw2364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [18:09:12] mutante: does that want downtiming if it's you doing buster ^ [18:10:21] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2364 is CRITICAL: Host mw2364 is not in mediawiki-installation dsh group daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:10:21] ACKNOWLEDGEMENT - Apache HTTP on mw2364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Application_servers [18:10:21] ACKNOWLEDGEMENT - Host mw2364 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reimaging [18:10:45] RECOVERY - cassandra-c service on restbase2009 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:10:52] RhinosF1: yea, the thing is that maybe 1 out of 10 cases setting the downtime fails. fixed! [18:10:59] RECOVERY - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is OK: SSL OK - Certificate restbase2009-c valid until 2022-06-15 10:34:47 +0000 (expires in 508 days) https://phabricator.wikimedia.org/T120662 [18:11:06] mutante: np! [18:11:29] RECOVERY - PHP7 rendering on mw2360 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [18:11:31] RECOVERY - Apache HTTP on mw2364 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:11:42] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2358.codfw.wmnet'] ` an... [18:11:58] talking about icinga alerts.. because I am looking at the web UI.. what about all the wdqs alerts about SPARQL [18:12:00] mutante: 140G is great :) thank you [18:12:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2360.codfw.wmnet'] ` an... [18:12:20] marxarelli: you're welcome, i'll give the ticket back in a moment [18:12:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2362.codfw.wmnet'] ` an... [18:12:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2364.codfw.wmnet'] ` an... [18:17:45] !log releases2002 - rebooting to confirm works now and also new disk gets auto-mounted [18:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:02] (03PS5) 10Dduvall: releases: Provide docker to PipelineLib based jobs [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) [18:20:30] (03PS1) 10Jbond: enable-puppet: allow fall back to enable puppet disabled by root [puppet] - 10https://gerrit.wikimedia.org/r/657889 (https://phabricator.wikimedia.org/T272539) [18:20:38] marxarelli: confirmed releases2002 survives reboot and the new disk gets mounted automatically. done [18:20:47] \o/ [18:21:04] i've tweaked the puppet patch to set `data-dir: /srv/docker` [18:25:42] great! *nod* [18:28:03] 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10Dzahn) releases1002 had the exact same issue.. so confirmed it was caused by adding the new disk. The same fix (ens5->ens6) also resolved it again. [18:28:59] (03PS1) 10Bstorm: data-services: apply user variances to future creations [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) [18:29:41] 10Puppet, 10SRE, 10Patch-For-Review: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10jbond) i sent a quick patch for this. in an ideal world it would be good to make it so only cumin* can disable puppet as root and when it dose so it also appends the user string (simi... [18:33:04] (03CR) 10Dzahn: [C: 03+2] releases: Provide docker to PipelineLib based jobs [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [18:34:02] (03CR) 10Dzahn: [C: 03+2] "I created /srv/docker." [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [18:35:40] marxarelli: oops, Function lookup() did not find a value for the name 'profile::docker::engine::declare_service [18:36:05] mutante: oh shoot. ok [18:36:06] * marxarelli looks [18:37:24] looks like i touched that last, heh [18:37:34] but just hiera->lookup replacement [18:38:10] let's see what it is for contint1001... [18:38:19] marxarelli: i guess it should be false [18:38:21] ok [18:38:26] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/pe [18:38:26] t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:38:36] "# We want this to be on if we want to use a different docker systemd service (with flannel support, for eg.)" [18:38:38] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:38:53] that seems right [18:39:02] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [18:39:14] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikime [18:39:14] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:39:36] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:39:56] (03PS1) 10Jgreen: replace check_swap with check_memory globally in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/657893 [18:39:59] marxarelli: somehow this is not set on contint and i just see it on kubernetes [18:40:20] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:40:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:40:44] i see it declared in `hieradata/role/common/builder.yaml` too [18:40:58] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:41:03] but i don't think that applies to contints [18:41:11] contint is not using the docker::engine class [18:41:17] apparently [18:41:23] ok [18:41:32] i'll just set it to false for releases [18:41:38] that should be fine [18:41:39] yea, let's do that [18:41:45] are you making a patch? [18:42:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:42:33] yep! [18:43:03] cool, will merge, be back in a minute [18:43:29] (03PS1) 10Andrew Bogott: Designate monitoring: make regexp for designate-api detection more lenient [puppet] - 10https://gerrit.wikimedia.org/r/657894 [18:44:09] (03CR) 10Jgreen: [C: 03+2] replace check_swap with check_memory globally in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/657893 (owner: 10Jgreen) [18:44:18] (03CR) 10Andrew Bogott: [C: 03+2] Designate monitoring: make regexp for designate-api detection more lenient [puppet] - 10https://gerrit.wikimedia.org/r/657894 (owner: 10Andrew Bogott) [18:44:39] (03PS1) 10Dduvall: releases: Set declare_service: false for docker [puppet] - 10https://gerrit.wikimedia.org/r/657895 (https://phabricator.wikimedia.org/T271477) [18:45:14] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:45:28] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [18:45:28] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikime [18:45:28] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:45:54] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:45:55] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2358.codfw.wmnet [18:45:55] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2360.codfw.wmnet [18:45:55] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2364.codfw.wmnet [18:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:16] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2362.codfw.wmnet [18:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:36] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [18:46:46] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2358.codfw.wmnet [18:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:02] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2360.codfw.wmnet [18:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:26] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:47:26] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [18:47:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2362.codfw.wmnet [18:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:43] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2364.codfw.wmnet [18:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:49:09] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:50:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:51:13] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:51:57] (03CR) 10Dzahn: [C: 03+2] releases: Set declare_service: false for docker [puppet] - 10https://gerrit.wikimedia.org/r/657895 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [18:52:36] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/pe [18:52:36] t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:53:10] Docker::Configuration/File[/etc/docker]/ensure: created [18:53:46] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:55:27] mutante: great! and looks like the jenkins agent has access to the socket so we're good to go [18:57:10] marxarelli: can confirm there is now dockerd and docker-containerd running on both releases* [18:57:54] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikime [18:57:54] ews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [18:58:03] we are checking AQS --^ [18:58:20] elukey: thank you [18:58:25] marxarelli: claimed T208529 is resolved [18:58:26] T208529: Install docker on releases-jenkins - https://phabricator.wikimedia.org/T208529 [18:58:36] 10SRE, 10SRE-tools: Use static PHIDs instead of fragile Phab project names in in modules/icinga/files/raid_handler.py - https://phabricator.wikimedia.org/T272233 (10jcrespo) p:05Triage→03Low Marking it as low, hopefully to be done at some point, but I couldn't find the time this week, and it is unlikely to... [18:59:01] mutante: ack. ty! [19:00:54] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:42] oh.. let's see ^ [19:02:20] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.w [19:02:20] /Services/Monitoring/aqs [19:02:30] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:03:36] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:03:37] marxarelli: grbml... ifup@ens6.service loaded failed failed ifup for ens6 [19:03:40] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:03:49] that is related to the fix with the renamed interface [19:03:50] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [19:04:00] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 12.85 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:04:03] and shows up now after docker was added.. hrmm [19:05:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:05:40] PROBLEM - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is CRITICAL: connect to address 10.64.48.148 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [19:06:32] PROBLEM - cassandra-a service on aqs1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:06:48] RECOVERY - mediawiki-installation DSH group on mw2364 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:07:04] PROBLEM - Check systemd state on aqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:08] mutante: that's odd. and release2002 seems fine [19:07:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:07:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2356.codfw.wmnet with reason: REIMAGE [19:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2354.codfw.wmnet with reason: REIMAGE [19:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:43] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10KFrancis) @jcrespo I am confirming the NDA is fully executed. Please proceed with the access request. Thanks! [19:09:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2352.codfw.wmnet with reason: REIMAGE [19:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2356.codfw.wmnet with reason: REIMAGE [19:09:54] !log releases1002 systemctl reset-failed [19:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2350.codfw.wmnet with reason: REIMAGE [19:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2354.codfw.wmnet with reason: REIMAGE [19:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:48] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 10.86 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:13:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2352.codfw.wmnet with reason: REIMAGE [19:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:16] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2350.codfw.wmnet with reason: REIMAGE [19:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:19:06] PROBLEM - Memcached on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [19:19:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:22:50] ACKNOWLEDGEMENT - Memcached on mw2350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Memcached [19:29:36] RECOVERY - cassandra-a service on aqs1006 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:30:22] RECOVERY - Check systemd state on aqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:26] RECOVERY - Memcached on mw2350 is OK: TCP OK - 0.034 second response time on 10.192.32.200 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [19:30:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2354.codfw.wmnet'] ` an... [19:31:15] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2356.codfw.wmnet'] ` an... [19:31:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2352.codfw.wmnet'] ` an... [19:32:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2350.codfw.wmnet'] ` an... [19:34:43] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2356.codfw.wmnet [19:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:06] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2354.codfw.wmnet [19:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2350.codfw.wmnet [19:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:46] RECOVERY - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is OK: TCP OK - 0.000 second response time on 10.64.48.148 port 9042 https://phabricator.wikimedia.org/T93886 [19:35:49] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2352.codfw.wmnet [19:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2350.codfw.wmnet [19:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:06] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:30] 10Puppet, 10SRE, 10Patch-For-Review: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10jbond) >>! In T272539#6769671, @jbond wrote: > i sent a quick patch for this. in an ideal world it would be good to make it so only cumin* can disable puppet as root and when it dose... [19:38:17] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2352.codfw.wmnet [19:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:27] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2354.codfw.wmnet [19:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2356.codfw.wmnet [19:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:42] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:41:00] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:41:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:41:27] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:41:32] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:43:35] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:45:04] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:49:30] 10SRE, 10observability: Grafana error: "parse error at char 1: unexpected character: '\\ufeff'" when copy-pasting metric names - https://phabricator.wikimedia.org/T263624 (10colewhite) I was able to reproduce this issue with select+middle click. Upstream has an issue on file: https://github.com/grafana/grafan... [19:53:10] (03PS2) 10Legoktm: docker_registry_ha: Add timestamp to build-homepage output [puppet] - 10https://gerrit.wikimedia.org/r/657678 (https://phabricator.wikimedia.org/T179696) [19:55:22] (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Add timestamp to build-homepage output [puppet] - 10https://gerrit.wikimedia.org/r/657678 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [19:55:36] (03CR) 10Ayounsi: [C: 03+2] Add Lumen transit BGP in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/657777 (https://phabricator.wikimedia.org/T269808) (owner: 10Ayounsi) [19:56:27] (03Merged) 10jenkins-bot: Add Lumen transit BGP in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/657777 (https://phabricator.wikimedia.org/T269808) (owner: 10Ayounsi) [19:57:50] 10SRE: releases2002 ganeti VM not getting IP after reboot - https://phabricator.wikimedia.org/T272555 (10Dzahn) >>! In T272555#6768676, @akosiaris wrote: > let's document this. added this https://wikitech.wikimedia.org/w/index.php?title=Ganeti&type=revision&diff=1894909&oldid=1893790 [19:58:26] (03PS2) 10Legoktm: libraryupgrader: Update celery systemd units [puppet] - 10https://gerrit.wikimedia.org/r/657459 [19:59:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2334.codfw.wmnet with reason: REIMAGE [19:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:00] (03CR) 10Legoktm: [C: 03+2] libraryupgrader: Update celery systemd units [puppet] - 10https://gerrit.wikimedia.org/r/657459 (owner: 10Legoktm) [20:00:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2332.codfw.wmnet with reason: REIMAGE [20:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2330.codfw.wmnet with reason: REIMAGE [20:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2328.codfw.wmnet with reason: REIMAGE [20:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1413.eqiad.wmnet with reason: REIMAGE [20:01:00] (03CR) 10ArielGlenn: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man) [20:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2334.codfw.wmnet with reason: REIMAGE [20:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:51] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1413.eqiad.wmnet with reason: REIMAGE [20:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2328.codfw.wmnet with reason: REIMAGE [20:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:01] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2332.codfw.wmnet with reason: REIMAGE [20:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:11] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2330.codfw.wmnet with reason: REIMAGE [20:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1268.eqiad.wmnet with reason: REIMAGE [20:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:24] PROBLEM - mediawiki-installation DSH group on mw2330 is CRITICAL: Host mw2330 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:06:30] PROBLEM - PHP7 rendering on mw2332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:06:34] PROBLEM - Memcached on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [20:07:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1268.eqiad.wmnet with reason: REIMAGE [20:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:57] PROBLEM - Apache HTTP on mw2330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:12:30] ACKNOWLEDGEMENT - Memcached on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Memcached [20:12:30] ACKNOWLEDGEMENT - PHP7 rendering on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:12:30] ACKNOWLEDGEMENT - Apache HTTP on mw2330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Application_servers [20:12:30] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2330 is CRITICAL: Host mw2330 is not in mediawiki-installation dsh group daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:12:30] ACKNOWLEDGEMENT - PHP7 rendering on mw2332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn reimaging https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:13:17] (03PS2) 10Legoktm: scap: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655795 (https://phabricator.wikimedia.org/T266479) [20:20:05] PROBLEM - Host mw2332 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:11] RECOVERY - Host mw2332 is UP: PING OK - Packet loss = 0%, RTA = 31.87 ms [20:21:31] PROBLEM - Host mw1413 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:57] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2334.codfw.wmnet'] ` an... [20:22:27] RECOVERY - Host mw1413 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:22:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2330.codfw.wmnet'] ` an... [20:22:36] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2328.codfw.wmnet'] ` an... [20:23:07] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2332.codfw.wmnet'] ` an... [20:23:51] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1413.eqiad.wmnet'] ` an... [20:25:33] RECOVERY - Memcached on mw1413 is OK: TCP OK - 0.000 second response time on 10.64.32.132 port 11210 https://wikitech.wikimedia.org/wiki/Memcached [20:25:53] (03PS1) 10Legoktm: threedtopng: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657903 (https://phabricator.wikimedia.org/T266479) [20:26:55] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27630/console" [puppet] - 10https://gerrit.wikimedia.org/r/657903 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [20:28:13] (03CR) 10Legoktm: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27629/" [puppet] - 10https://gerrit.wikimedia.org/r/655795 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [20:32:15] RECOVERY - PHP7 rendering on mw2332 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:33:41] (03PS1) 10Legoktm: udp2log: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657904 (https://phabricator.wikimedia.org/T266479) [20:34:29] RECOVERY - Apache HTTP on mw2330 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.150 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:35:16] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27631/console" [puppet] - 10https://gerrit.wikimedia.org/r/657904 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [20:37:23] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:03] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.033 second response time on 10.192.48.56 port 9042 https://phabricator.wikimedia.org/T93886 [20:49:31] (03PS1) 10Ottomata: [WIP] eventgate - Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237) [20:50:51] (03PS2) 10Ottomata: [WIP] eventgate - Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237) [20:55:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1268.eqiad.wmnet'] ` an... [20:59:17] 10SRE, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10Dwisehaupt) 05Open→03Resolved [21:05:44] (03PS1) 10Ryan Kemper: wdqs: use envoy listener for wdqs-internal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657910 [21:07:02] (03CR) 10jerkins-bot: [V: 04-1] wdqs: use envoy listener for wdqs-internal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657910 (owner: 10Ryan Kemper) [21:12:45] (03CR) 10Urbanecm: Create Contact page for Ombuds commission at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) (owner: 10Luke081515) [21:13:20] (03CR) 10Urbanecm: [C: 03+1] "sounds good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) (owner: 10Luke081515) [21:25:14] (03Abandoned) 10Ryan Kemper: wdqs: use envoy listener for wdqs-internal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657910 (owner: 10Ryan Kemper) [21:46:05] (03PS1) 10Ryan Kemper: wdqs: use envoy for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/657913 [21:56:36] (03PS1) 10Urbanecm: Revert "Revert "[enwiki] Update celebration logo to "option A""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657914 (https://phabricator.wikimedia.org/T272747) [21:58:14] (03PS2) 10Urbanecm: Revert "Revert "[enwiki] Update celebration logo to "option A""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657914 (https://phabricator.wikimedia.org/T272526) [22:05:39] (03PS1) 10Legoktm: snapshot: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657916 (https://phabricator.wikimedia.org/T266479) [22:08:07] PROBLEM - mediawiki-installation DSH group on mw2334 is CRITICAL: Host mw2334 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:08:23] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27634/console" [puppet] - 10https://gerrit.wikimedia.org/r/657916 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [22:13:49] (03PS1) 10Legoktm: superset: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657917 (https://phabricator.wikimedia.org/T266479) [22:14:40] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27635/console" [puppet] - 10https://gerrit.wikimedia.org/r/657917 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [22:21:01] 10SRE, 10Beta-Cluster-Infrastructure, 10Wikidata, 10serviceops, and 3 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10Addshore) [22:26:41] PROBLEM - mediawiki-installation DSH group on mw1268 is CRITICAL: Host mw1268 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:27:21] PROBLEM - mediawiki-installation DSH group on mw1413 is CRITICAL: Host mw1413 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:30:41] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:19] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:39] PROBLEM - mediawiki-installation DSH group on mw2332 is CRITICAL: Host mw2332 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:40:20] 10SRE, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10thcipriani) 05Open→03Resolved [22:40:24] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 3 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10thcipriani) [22:41:39] !log reedy@deploy1001 Synchronized invalid.json: (no justification provided) (duration: 00m 58s) [22:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:47] * Reedy looks at legoktm [22:41:56] uhoh [22:42:08] using sync-file? [22:42:11] yeah [22:42:28] I'm not gonna run scap all of the things [22:42:36] right [22:42:42] but that's a bad regression [22:42:53] one more for luck [22:43:22] legoktm: definitely works for PHP [22:43:31] CalledProcessError: Command 'find -O2 '/srv/mediawiki-staging/invalid.php' -not -type d -name '*.php' -not -name 'autoload_static.php' -or -name '*.inc' | xargs -n1 -P30 -exec php -l >/dev/null 2>&1' returned non-zero exit status 124 [22:43:31] 22:43:16 sync-file failed: Command 'find -O2 '/srv/mediawiki-staging/invalid.php' -not -type d -name '*.php' -not -name 'autoload_static.php' -or -name '*.inc' | xargs -n1 -P30 -exec php -l >/dev/null 2>&1' returned non-zero exit status 124 [22:43:38] crap error message, sure, but bails [22:44:03] PROBLEM - mediawiki-installation DSH group on mw2328 is CRITICAL: Host mw2328 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:47:52] >>> list(os.walk('invalid.json')) [22:47:52] [] [22:51:12] que? [22:51:17] patch incoming [22:54:14] bugsbugsbugs [23:02:46] https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/657921/ [23:31:21] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:14] (03Abandoned) 10Urbanecm: Revert "Revert "[enwiki] Update celebration logo to "option A""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657914 (https://phabricator.wikimedia.org/T272526) (owner: 10Urbanecm) [23:38:11] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:47] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:05] 10SRE, 10Wikidata, 10wdwb-tech-focus: entity/Q64 was indexed in Google, should it have been? - https://phabricator.wikimedia.org/T227246 (10Addshore) [23:54:45] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state