[00:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201118T0000). [00:00:04] hmonroy and tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:16] o/ [00:01:06] o/ [00:01:11] tgr_: want to deploy? [00:01:35] I can deploy, sure [00:01:55] It's yours then :) [00:02:48] (03PS3) 10Gergő Tisza: Enable watchlist expiry feature on Wikidata & Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641250 (https://phabricator.wikimedia.org/T266874) (owner: 10HMonroy) [00:04:38] (03CR) 10Dzahn: "otrs1001: systemctl list-timers" [puppet] - 10https://gerrit.wikimedia.org/r/637038 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [00:06:02] 10Operations, 10Anti-Harassment, 10Trust-and-Safety, 10User-DannyS712: Grant checkuser rights to DannyS712 on testwiki - https://phabricator.wikimedia.org/T268090 (10DannyS712) [00:07:45] (03CR) 10Gergő Tisza: [C: 03+2] Enable watchlist expiry feature on Wikidata & Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641250 (https://phabricator.wikimedia.org/T266874) (owner: 10HMonroy) [00:08:36] (03Merged) 10jenkins-bot: Enable watchlist expiry feature on Wikidata & Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641250 (https://phabricator.wikimedia.org/T266874) (owner: 10HMonroy) [00:10:14] (03PS2) 10Gergő Tisza: GrowthExperiments: Enable help panel top-posting on ruwiki, svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641282 [00:11:17] 10Operations, 10Anti-Harassment, 10Trust-and-Safety, 10User-DannyS712: Grant checkuser rights to DannyS712 on testwiki - https://phabricator.wikimedia.org/T268090 (10DannyS712) @Trust and safety - if this is approved, can you please let me know (either here or via email) the specific guidelines that would... [00:11:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:12:40] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Enable help panel top-posting on ruwiki, svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641282 (owner: 10Gergő Tisza) [00:13:24] (03Merged) 10jenkins-bot: GrowthExperiments: Enable help panel top-posting on ruwiki, svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641282 (owner: 10Gergő Tisza) [00:14:08] tgr_: is T266874 live on one of the wmdebug servers yet? [00:14:09] T266874: Watchlist Expiry: enable the feature on Wikidata & Commons [NOV 17] - https://phabricator.wikimedia.org/T266874 [00:14:43] musikanimal: it is now, on 1001 [00:18:59] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/641528 (owner: 10Herron) [00:21:25] RECOVERY - MariaDB Replica Lag: pc1 on pc2010 is OK: OK slave_sql_lag Replication lag: 19.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:22:54] tgr_: thanks, looks good! [00:27:10] (03CR) 10Gergő Tisza: [C: 03+2] Suggested edits: Guard against empty topic data [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641295 (https://phabricator.wikimedia.org/T268015) (owner: 10Kosta Harlan) [00:27:18] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:641250|Enable watchlist expiry feature on Wikidata & Commons (T266874)]] (duration: 01m 03s) [00:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:27] T266874: Watchlist Expiry: enable the feature on Wikidata & Commons [NOV 17] - https://phabricator.wikimedia.org/T266874 [00:27:37] musikanimal: great! it's live [00:27:50] ty! [00:28:49] tgr_: Thank you! [00:37:55] (03Merged) 10jenkins-bot: Suggested edits: Guard against empty topic data [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641295 (https://phabricator.wikimedia.org/T268015) (owner: 10Kosta Harlan) [00:52:28] !log tgr@deploy1001 Synchronized php-1.36.0-wmf.18/extensions/GrowthExperiments/includes/HomepageModules/SuggestedEdits.php: Backport: [[gerrit:641295|Suggested edits: Guard against empty topic data (T268015)]] (duration: 01m 07s) [00:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:36] T268015: PHP Notice: Undefined index: society - https://phabricator.wikimedia.org/T268015 [00:53:02] !log also deployed [[gerrit:641294|Suggested Edits: Guard against task type not existing (T268012)]] [00:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:08] T268012: Call to a member function getDifficulty() on null - https://phabricator.wikimedia.org/T268012 [00:58:06] (03PS1) 10Dzahn: planet: let systemd timer for each language run at random minute [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) [00:58:39] (03CR) 10jerkins-bot: [V: 04-1] planet: let systemd timer for each language run at random minute [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [01:02:28] (03PS2) 10Dzahn: planet: let systemd timer for each language run at random minute [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) [01:02:58] (03CR) 10jerkins-bot: [V: 04-1] planet: let systemd timer for each language run at random minute [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [01:07:08] (03PS3) 10Dzahn: planet: let systemd timer for each language run at random minute [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) [01:09:18] (03PS4) 10Dzahn: planet: let systemd timer for each language run at random minute [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) [01:21:45] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:21:51] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add STran to `wmf` LDAP group - https://phabricator.wikimedia.org/T267968 (10STran) @herron I would like access so I can +2 and merge PRs as part of my responsibilities as a software engineer on the anti-harassment tools team. Thanks! [01:21:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:23:42] (03PS5) 10Dzahn: planet: let systemd timer for each language run at random minute [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) [01:24:11] (03CR) 10jerkins-bot: [V: 04-1] planet: let systemd timer for each language run at random minute [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [01:25:40] (03PS6) 10Dzahn: planet: let systemd timer for each language run at random minute [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) [01:35:11] (03PS7) 10Dzahn: planet: let systemd timer for each language run at random minute [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) [01:39:39] 10Operations, 10Wikimedia-Mailing-lists: Wikimania-com mailing list: who’s the admin ? - https://phabricator.wikimedia.org/T268031 (10Dzahn) 05Open→03Resolved a:03Dzahn Cool, thanks for confirming. I'm calling it resolved then. [01:40:16] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26497/planet1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [01:44:05] (03CR) 10Dzahn: "NEXT LEFT LAST PASSED UNIT" [puppet] - 10https://gerrit.wikimedia.org/r/641579 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [02:00:25] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:21] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:59:07] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:25] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (seaborgium, ...), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:45:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1018 (re)pooling @ 10%: Slowly pool es1018 after cloning es1032 T261717', diff saved to https://phabricator.wikimedia.org/P13306 and previous config saved to /var/cache/conftool/dbconfig/20201118-054542-root.json [05:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:51] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:47:14] !log Run check table on enwiki on db1124:3311 T267090 [05:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:22] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [05:50:33] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:51:12] (03PS1) 10Marostegui: instances.yaml: Add es1032 [puppet] - 10https://gerrit.wikimedia.org/r/641616 (https://phabricator.wikimedia.org/T261717) [05:51:35] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:21] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add es1032 [puppet] - 10https://gerrit.wikimedia.org/r/641616 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:54:32] (03PS1) 10Marostegui: es1032: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/641617 (https://phabricator.wikimedia.org/T261717) [06:00:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1018 (re)pooling @ 25%: Slowly pool es1018 after cloning es1032 T261717', diff saved to https://phabricator.wikimedia.org/P13307 and previous config saved to /var/cache/conftool/dbconfig/20201118-060045-root.json [06:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:53] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:01:46] (03CR) 10Marostegui: [C: 03+2] es1032: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/641617 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:06:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es1032 with minimum weight on es1 T261717', diff saved to https://phabricator.wikimedia.org/P13308 and previous config saved to /var/cache/conftool/dbconfig/20201118-060641-marostegui.json [06:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:49] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:09:12] (03PS1) 10Marostegui: instances.yaml: Remove es1011 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/641624 [06:10:20] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es1011 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/641624 (owner: 10Marostegui) [06:11:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es1011 from dbctl', diff saved to https://phabricator.wikimedia.org/P13309 and previous config saved to /var/cache/conftool/dbconfig/20201118-061112-marostegui.json [06:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1027 as new es1 master', diff saved to https://phabricator.wikimedia.org/P13310 and previous config saved to /var/cache/conftool/dbconfig/20201118-061218-marostegui.json [06:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1014 before decommissioning it', diff saved to https://phabricator.wikimedia.org/P13311 and previous config saved to /var/cache/conftool/dbconfig/20201118-061340-marostegui.json [06:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:49] (03PS1) 10Marostegui: instances.yaml: Remove es1014 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/641627 [06:15:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1018 (re)pooling @ 50%: Slowly pool es1018 after cloning es1032 T261717', diff saved to https://phabricator.wikimedia.org/P13312 and previous config saved to /var/cache/conftool/dbconfig/20201118-061549-root.json [06:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:57] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:24:07] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es1014 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/641627 (owner: 10Marostegui) [06:25:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es1014 from dbctl', diff saved to https://phabricator.wikimedia.org/P13313 and previous config saved to /var/cache/conftool/dbconfig/20201118-062547-marostegui.json [06:25:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: Slowly pool es1032 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13314 and previous config saved to /var/cache/conftool/dbconfig/20201118-062551-root.json [06:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:00] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:29:18] (03PS12) 10Ryan Kemper: Bring 3 new eqiad wdqs nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) [06:30:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1018 (re)pooling @ 75%: Slowly pool es1018 after cloning es1032 T261717', diff saved to https://phabricator.wikimedia.org/P13315 and previous config saved to /var/cache/conftool/dbconfig/20201118-063052-root.json [06:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:40] (03CR) 10Ryan Kemper: Bring 3 new eqiad wdqs nodes into service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) (owner: 10Ryan Kemper) [06:32:42] (03CR) 10Ryan Kemper: [C: 03+2] Bring 3 new eqiad wdqs nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) (owner: 10Ryan Kemper) [06:33:21] (03PS13) 10Ryan Kemper: Bring 3 new eqiad wdqs nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) [06:37:49] !log restart kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002 - consumer msg rate low since kafka-main2003 went down for codfw c7 failure [06:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 20%: Slowly pool es1032 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13316 and previous config saved to /var/cache/conftool/dbconfig/20201118-064054-root.json [06:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:02] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:45:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1018 (re)pooling @ 100%: Slowly pool es1018 after cloning es1032 T261717', diff saved to https://phabricator.wikimedia.org/P13317 and previous config saved to /var/cache/conftool/dbconfig/20201118-064556-root.json [06:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:03] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:53:53] !log restart also mirror maker on kafka-main1001/1003 (seems not related but just to clear old errors and a possible weird state) [06:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:45] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641629 (https://phabricator.wikimedia.org/T266483) [06:55:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: Slowly pool es1032 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13318 and previous config saved to /var/cache/conftool/dbconfig/20201118-065558-root.json [06:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:05] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:08:49] PROBLEM - Query Service HTTP Port on wdqs1011 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:10:09] RECOVERY - Query Service HTTP Port on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 449 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:11:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 30%: Slowly pool es1032 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13319 and previous config saved to /var/cache/conftool/dbconfig/20201118-071101-root.json [07:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:09] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:14:30] (03CR) 10ArielGlenn: [C: 03+1] "Looking good!" [puppet] - 10https://gerrit.wikimedia.org/r/641451 (https://phabricator.wikimedia.org/T267575) (owner: 10Milimetric) [07:16:02] !log Run check table on s6 on db1125:3316 T267090 [07:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:09] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [07:26:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: Slowly pool es1032 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13320 and previous config saved to /var/cache/conftool/dbconfig/20201118-072605-root.json [07:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:12] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:28:33] !log Start of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=nlwiki; T246539) [07:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:39] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [07:40:46] (03CR) 10Urbanecm: [C: 03+1] Regenerate Bengali Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640816 (https://phabricator.wikimedia.org/T265553) (owner: 10Zoranzoki21) [07:41:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 60%: Slowly pool es1032 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13321 and previous config saved to /var/cache/conftool/dbconfig/20201118-074108-root.json [07:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:15] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:42:16] (03CR) 10Urbanecm: [C: 04-1] "This doesn't actually add the import sources, perhaps you wish to amend the previous commit?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640290 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [07:43:13] (03CR) 10Urbanecm: [C: 04-1] Add wgImportSources for zhwikinews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [07:43:17] (03CR) 10Urbanecm: [C: 04-1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [07:44:22] (03CR) 10jerkins-bot: [V: 04-1] Add wgImportSources for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [07:45:19] !log Deploy schema change on db1098:3316 T267335 T267399 [07:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:26] T267399: Drop default of ip_changes.ipc_rev_timestamp - https://phabricator.wikimedia.org/T267399 [07:45:27] T267335: Drop default of protected_titles.pt_expiry - https://phabricator.wikimedia.org/T267335 [07:45:55] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10Urbanecm) [07:47:16] (03PS1) 10Marostegui: control-mariadb-*: Update version to 10.4.17 [software] - 10https://gerrit.wikimedia.org/r/641630 [07:48:03] (03PS2) 10Marostegui: control-mariadb-*: Update version to 10.4.17 [software] - 10https://gerrit.wikimedia.org/r/641630 [07:49:01] (03CR) 10Marostegui: [C: 03+2] control-mariadb-*: Update version to 10.4.17 [software] - 10https://gerrit.wikimedia.org/r/641630 (owner: 10Marostegui) [07:49:36] (03Merged) 10jenkins-bot: control-mariadb-*: Update version to 10.4.17 [software] - 10https://gerrit.wikimedia.org/r/641630 (owner: 10Marostegui) [07:56:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: Slowly pool es1032 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13322 and previous config saved to /var/cache/conftool/dbconfig/20201118-075612-root.json [07:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:19] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:04:00] (03CR) 10Muehlenhoff: [C: 03+1] admin: create ldap_only entry for ijethrobt [puppet] - 10https://gerrit.wikimedia.org/r/641465 (https://phabricator.wikimedia.org/T267962) (owner: 10Herron) [08:04:44] (03CR) 10Muehlenhoff: [C: 03+1] admin: create ldap_only account for stran [puppet] - 10https://gerrit.wikimedia.org/r/641463 (https://phabricator.wikimedia.org/T267968) (owner: 10Herron) [08:06:29] (03CR) 10Muehlenhoff: [C: 04-1] "The user is not in the list of WMDE employees in the tracking sheet, needs to sign the NDA with Legal first" [puppet] - 10https://gerrit.wikimedia.org/r/641509 (https://phabricator.wikimedia.org/T267771) (owner: 10Herron) [08:07:13] (03CR) 10Muehlenhoff: [C: 03+1] admin: add ldap_only_entry for tobias-schumann-wmde-ext [puppet] - 10https://gerrit.wikimedia.org/r/641510 (https://phabricator.wikimedia.org/T267917) (owner: 10Herron) [08:09:07] (03CR) 10Muehlenhoff: [C: 04-1] "Needs to sign an NDA with Legal, not in the tracking sheet." [puppet] - 10https://gerrit.wikimedia.org/r/641508 (https://phabricator.wikimedia.org/T267744) (owner: 10Herron) [08:11:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 80%: Slowly pool es1032 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13323 and previous config saved to /var/cache/conftool/dbconfig/20201118-081115-root.json [08:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:24] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:14:28] I will reset-failed the regular_snapshot timer, it is an expected failure due to T261405 [08:14:28] T261405: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 [08:14:40] (03CR) 10Muehlenhoff: [C: 03+1] "The" [puppet] - 10https://gerrit.wikimedia.org/r/641471 (https://phabricator.wikimedia.org/T267314) (owner: 10Herron) [08:14:55] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:00] on deneb, that it is failing is package_builder_Clean_up_build_directory.service [08:16:10] I will let that for someone else [08:19:03] jynus: I'll have a look in a bit [08:19:21] yeah, no rush [08:19:51] the other thing I am aware is LDAP servers backups failing in the last 2 days [08:20:01] I will research more [08:24:07] "Puppet CA: palladium.eqiad.wmnet, subject = /CN=serpens.wikimedia.org, ERR=10:certificate has expired" [08:26:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: Slowly pool es1032 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13324 and previous config saved to /var/cache/conftool/dbconfig/20201118-082618-root.json [08:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:27] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:26:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1012 before decommissioning it', diff saved to https://phabricator.wikimedia.org/P13325 and previous config saved to /var/cache/conftool/dbconfig/20201118-082636-marostegui.json [08:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:40] (03PS1) 10Marostegui: instances.yaml: Remove es1012 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/641687 (https://phabricator.wikimedia.org/T268101) [08:28:35] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es1012 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/641687 (https://phabricator.wikimedia.org/T268101) (owner: 10Marostegui) [08:29:06] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T267672 (10ayounsi) 05Resolved→03Open They're back... [08:29:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es1012 from dbctl T268101', diff saved to https://phabricator.wikimedia.org/P13326 and previous config saved to /var/cache/conftool/dbconfig/20201118-082942-marostegui.json [08:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:49] T268101: decommission es1012.eqiad.wmnet - https://phabricator.wikimedia.org/T268101 [08:31:19] (03PS1) 10Marostegui: mariadb: Disable notifications on es1011,es1012,es1014. [puppet] - 10https://gerrit.wikimedia.org/r/641688 (https://phabricator.wikimedia.org/T268100) [08:31:52] (03CR) 10Ayounsi: [C: 03+2] Cable report, log VC links with no ID as warning only [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641458 (owner: 10Ayounsi) [08:32:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on es1011,es1012,es1014. [puppet] - 10https://gerrit.wikimedia.org/r/641688 (https://phabricator.wikimedia.org/T268100) (owner: 10Marostegui) [08:34:47] !log Stop MySQL on es1011, es1012, es1014 T268100 T268101 T268102 [08:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:56] T268101: decommission es1012.eqiad.wmnet - https://phabricator.wikimedia.org/T268101 [08:34:56] T268100: decommission es1011.eqiad.wmnet - https://phabricator.wikimedia.org/T268100 [08:34:56] T268102: decommission es1014.eqiad.wmnet - https://phabricator.wikimedia.org/T268102 [08:36:44] 10Operations, 10Data-Persistence-Backup, 10LDAP: Backup failures on seaborgium, serpens (LDAP servers) - https://phabricator.wikimedia.org/T268104 (10jcrespo) [08:38:37] 10Operations, 10Data-Persistence-Backup, 10LDAP: Backup failures on seaborgium, serpens (LDAP servers) - https://phabricator.wikimedia.org/T268104 (10jcrespo) Adding @jbond @MoritzMuehlenhoff as it could be a simple puppet certificate issue regarding puppetmasters changes, and they may know more about that... [08:42:10] 04Critical Alert for device asw-c-codfw.mgmt.codfw.wmnet - Juniper alarm active [08:42:32] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) I will put the server back up temporarily for some hours so it catches up and we can generate a full backup before the maintenance. [08:48:54] 10Operations, 10ops-codfw: asw-c7-codfw: fan alarm - https://phabricator.wikimedia.org/T268105 (10ayounsi) p:05Triage→03High [08:48:59] (03PS1) 10Jcrespo: Revert "mariadb: Reduce memory consumption of mariadb@s6 while hw degraded" [puppet] - 10https://gerrit.wikimedia.org/r/641498 [08:49:41] (03CR) 10Jcrespo: [C: 04-2] "No deploy until T261405 is overcome." [puppet] - 10https://gerrit.wikimedia.org/r/641498 (owner: 10Jcrespo) [08:56:30] 10Operations, 10Data-Persistence-Backup, 10LDAP: Backup failures on seaborgium, serpens (LDAP servers) - https://phabricator.wikimedia.org/T268104 (10jcrespo) I also noted the following warning on seaborgium, serpens **AND sretest1002**. Related? WARNING: Failed to apply catalog, zero resources tracked by Pu... [09:06:00] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:11:36] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:13:48] !log renew puppet certificate of seaborgium [09:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:14] (03CR) 10Vgutierrez: "this patch makes sense but it isn't enough, we need to add pywikibot.org to the list of ncredir domains handled to acme-chief https://gerr" [dns] - 10https://gerrit.wikimedia.org/r/634928 (https://phabricator.wikimedia.org/T257536) (owner: 10Ladsgroup) [09:16:53] (03CR) 10David Caro: [C: 03+2] nagios.cgi: add nskaggs to info auth groups [puppet] - 10https://gerrit.wikimedia.org/r/640388 (https://phabricator.wikimedia.org/T266068) (owner: 10David Caro) [09:22:16] (03CR) 10Elukey: [C: 03+2] kerberos: add dns_canonicalize_hostname = false to clients [puppet] - 10https://gerrit.wikimedia.org/r/640100 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [09:22:35] !log set dns_canonicalize_hostname = false to all kerberos clients [09:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:53] (03CR) 10Filippo Giunchedi: [C: 03+1] aptrepo: add elastic710 component [puppet] - 10https://gerrit.wikimedia.org/r/641528 (owner: 10Herron) [09:26:59] (03PS1) 10Jbond: Revert "home/klausman: Add vim-go" [puppet] - 10https://gerrit.wikimedia.org/r/641499 [09:27:29] (03CR) 10jerkins-bot: [V: 04-1] Revert "home/klausman: Add vim-go" [puppet] - 10https://gerrit.wikimedia.org/r/641499 (owner: 10Jbond) [09:29:08] (03PS2) 10Jbond: Revert "home/klausman: Add vim-go" [puppet] - 10https://gerrit.wikimedia.org/r/641499 [09:36:07] (03PS1) 10Filippo Giunchedi: wikimedia.org: add alertmanager A records for API [dns] - 10https://gerrit.wikimedia.org/r/641697 (https://phabricator.wikimedia.org/T266017) [09:36:42] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T268071 (10fgiunchedi) [09:36:50] (03CR) 10jerkins-bot: [V: 04-1] wikimedia.org: add alertmanager A records for API [dns] - 10https://gerrit.wikimedia.org/r/641697 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [09:37:49] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) [09:37:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10elukey) @wiki_willy Hi! Do we have a timeline on how much time it will take to move the hosts to free space for the new hadoop worker nodes? I am asking... [09:37:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] ores: Drop all precaching puppet roles for labs [puppet] - 10https://gerrit.wikimedia.org/r/641533 (owner: 10Ladsgroup) [09:40:37] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:40:53] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:44:47] (03PS2) 10Filippo Giunchedi: wikimedia.org: add alertmanager A records for API [dns] - 10https://gerrit.wikimedia.org/r/641697 (https://phabricator.wikimedia.org/T266017) [09:45:24] (03CR) 10jerkins-bot: [V: 04-1] wikimedia.org: add alertmanager A records for API [dns] - 10https://gerrit.wikimedia.org/r/641697 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [09:49:31] (03PS1) 10Alexandros Kosiaris: network: Add kubesvc IPv6 service ranges [puppet] - 10https://gerrit.wikimedia.org/r/641698 [09:51:48] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [09:54:20] (03CR) 10Klausman: [C: 03+2] Revert "home/klausman: Add vim-go" [puppet] - 10https://gerrit.wikimedia.org/r/641499 (owner: 10Jbond) [09:55:17] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) Running one last check over all section instances to confirm the change has been made everywhere. [09:56:34] !log uploaded libexif 0.6.21-2+deb8u4+wmf1 to jessie-wikimedia [09:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:44] !log eqiad row D: Standardize interfaces descriptions [10:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:43] (03PS3) 10Filippo Giunchedi: wikimedia.org: add alertmanager records for API [dns] - 10https://gerrit.wikimedia.org/r/641697 (https://phabricator.wikimedia.org/T266017) [10:04:31] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] network: Add kubesvc IPv6 service ranges [puppet] - 10https://gerrit.wikimedia.org/r/641698 (owner: 10Alexandros Kosiaris) [10:06:27] whatttt [10:07:05] replicate_krb_database[8397]: /usr/sbin/kprop: Key table entry not found while getting initial credentials [10:07:29] PROBLEM - Check the last execution of replicate-krb-database on krb1001 is CRITICAL: CRITICAL: Status of the systemd unit replicate-krb-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:08:12] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10akosiaris) p:05Low→03High More and more duplicates are being merged into this one and stats from tests... [10:08:15] sigh [10:08:26] elukey: that's the spirit! [10:11:53] (03PS1) 10Alexandros Kosiaris: Add k8s-staging-codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/641704 [10:13:48] (03PS1) 10Filippo Giunchedi: alertmanager: add aliases for API vhost [puppet] - 10https://gerrit.wikimedia.org/r/641705 (https://phabricator.wikimedia.org/T266017) [10:16:39] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:09] (03PS1) 10Jbond: sretest: drop P::docker::firewall as it requires buster [puppet] - 10https://gerrit.wikimedia.org/r/641706 (https://phabricator.wikimedia.org/T268104) [10:17:31] 10Operations, 10Data-Persistence-Backup, 10LDAP, 10Patch-For-Review: Backup failures on seaborgium, serpens (LDAP servers) - https://phabricator.wikimedia.org/T268104 (10jbond) >>! In T268104#6629682, @jcrespo wrote: > I also noted the following warning on seaborgium, serpens **AND sretest1002**. Related?... [10:17:47] PROBLEM - Check systemd state on kubernetes1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26499/console" [puppet] - 10https://gerrit.wikimedia.org/r/641705 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [10:18:30] 10Operations, 10Data-Persistence-Backup, 10LDAP, 10Patch-For-Review: Backup failures on seaborgium, serpens (LDAP servers) - https://phabricator.wikimedia.org/T268104 (10jcrespo) Let me do a test backup and I will close this as resolved, if you are ok with it. [10:21:59] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:23:42] 10Operations, 10Data-Persistence-Backup, 10LDAP, 10Patch-For-Review: Backup failures on seaborgium, serpens (LDAP servers) - https://phabricator.wikimedia.org/T268104 (10jcrespo) Backup still fails, for the same reasons: ` 18-Nov 10:19 backup1001.eqiad.wmnet JobId 282470: Start Backup JobId 282470, Job=se... [10:23:47] (03CR) 10Muehlenhoff: "Please keep sretest1002 on stretch, so that we can do tests with kernels/microcode etc. on a stretch system as well" [puppet] - 10https://gerrit.wikimedia.org/r/641706 (https://phabricator.wikimedia.org/T268104) (owner: 10Jbond) [10:25:50] !log ms-be1022 - disable failed sdb [10:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:31] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Proton [10:29:17] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [10:30:39] PROBLEM - MD RAID on ms-be1022 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:30:40] ACKNOWLEDGEMENT - MD RAID on ms-be1022 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T268123 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:30:43] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T268123 (10ops-monitoring-bot) [10:31:14] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T268123 (10fgiunchedi) This was me manually failing sdb [10:31:26] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T268123 (10fgiunchedi) [10:31:28] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) [10:33:15] (03PS2) 10Filippo Giunchedi: alertmanager: add aliases for API vhost [puppet] - 10https://gerrit.wikimedia.org/r/641705 (https://phabricator.wikimedia.org/T266017) [10:33:41] 10Operations, 10Data-Persistence-Backup, 10LDAP, 10Patch-For-Review: Backup failures on seaborgium, serpens (LDAP servers) - https://phabricator.wikimedia.org/T268104 (10jcrespo) 05Open→03Resolved a:03jbond It needed a bacula-fd client daemon restart. ` 282473 Incr 2 28.74 M OK... [10:35:00] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26500/console" [puppet] - 10https://gerrit.wikimedia.org/r/641705 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [10:35:28] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] alertmanager: add aliases for API vhost [puppet] - 10https://gerrit.wikimedia.org/r/641705 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [10:35:37] (03PS3) 10Filippo Giunchedi: alertmanager: add aliases for API vhost [puppet] - 10https://gerrit.wikimedia.org/r/641705 (https://phabricator.wikimedia.org/T266017) [10:38:21] (03PS1) 10Hashar: Fix NewcomerTasksCacheRefreshJob [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641501 (https://phabricator.wikimedia.org/T268008) [10:39:44] (03CR) 10Hashar: [C: 03+1] "That is a straightforward one. I had a similar issue in a previous train." [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641501 (https://phabricator.wikimedia.org/T268008) (owner: 10Hashar) [10:44:39] RECOVERY - Check systemd state on kubernetes1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:32] (03PS1) 10Alexandros Kosiaris: Remove auth DNS servers from network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/641708 [10:46:39] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:51] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:48:52] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:17] RECOVERY - Check the last execution of replicate-krb-database on krb1001 is OK: OK: Status of the systemd unit replicate-krb-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:51:06] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:51:06] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:37] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:52:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack: add client packages for Stein [puppet] - 10https://gerrit.wikimedia.org/r/641233 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [10:53:27] 10Operations, 10ops-eqiad: Invalid port info on asw2-d-eqiad - https://phabricator.wikimedia.org/T268125 (10ayounsi) [10:54:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I assume this is just a copy&paste from the previous openstack version. There might be a bunch of configuration we don't need, or require " [puppet] - 10https://gerrit.wikimedia.org/r/641231 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [10:56:09] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:57:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack Designate: updates for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641232 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [10:58:54] !log renew sretest1002 ssl cert to test cookbook [10:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:26] !log jbond@cumin1001 START - Cookbook sre.puppet.renew-cert [10:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:34] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) [10:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:05] (03PS1) 10Jbond: sre.puppet.renew-cert: fix arguments for puppet_master commands [cookbooks] - 10https://gerrit.wikimedia.org/r/641710 [11:19:14] (03PS1) 10JMeybohm: kubernetes: Add profile to install addon-manager on masters [puppet] - 10https://gerrit.wikimedia.org/r/641711 (https://phabricator.wikimedia.org/T267653) [11:19:53] (03CR) 10JMeybohm: "See I6a352e0c2648feaa4990ad43b118ee501d7f7d21 for the package" [puppet] - 10https://gerrit.wikimedia.org/r/641711 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [11:22:46] (03PS2) 10JMeybohm: kubernetes: Add profile to install addon-manager on masters [puppet] - 10https://gerrit.wikimedia.org/r/641711 (https://phabricator.wikimedia.org/T267653) [11:24:42] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) 05Open→03Resolved Check completed successfully, we're done \o/ [11:24:45] (03PS3) 10JMeybohm: kubernetes: Add profile to install addon-manager on masters [puppet] - 10https://gerrit.wikimedia.org/r/641711 (https://phabricator.wikimedia.org/T267653) [11:24:46] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat) [11:25:11] (03PS4) 10JMeybohm: kubernetes: Add profile to install addon-manager on masters [puppet] - 10https://gerrit.wikimedia.org/r/641711 (https://phabricator.wikimedia.org/T267653) [11:25:46] (03PS5) 10JMeybohm: kubernetes: Add profile to install addon-manager on masters [puppet] - 10https://gerrit.wikimedia.org/r/641711 (https://phabricator.wikimedia.org/T267653) [11:28:23] PROBLEM - Device not healthy -SMART- on ms-be1022 is CRITICAL: cluster=swift device=None instance=ms-be1022 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1022&var-datasource=eqiad+prometheus/ops [11:40:30] !log eqiad row D: remove un-needed "enable" keywords [11:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:45] D: [11:41:25] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10BBlack) I haven't been able to repro this on a public endpoint from my own home connection, even using the... [11:41:30] (03Abandoned) 10Urbanecm: Add wgImportSources for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640290 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [11:45:57] (03CR) 10Jbond: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [11:50:51] ACKNOWLEDGEMENT - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms ayounsi https://phabricator.wikimedia.org/T268105 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:51:52] (03PS1) 10Alexandros Kosiaris: calico: Remove statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/641712 [11:51:54] (03PS1) 10Alexandros Kosiaris: recommendation-api: Open up access to 3306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/641713 (https://phabricator.wikimedia.org/T241230) [11:52:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Please do review, but keep in mind this isn't ready to be merged yet, the nodes aren't yet setup." [homer/public] - 10https://gerrit.wikimedia.org/r/641704 (owner: 10Alexandros Kosiaris) [11:52:51] (03CR) 10Kormat: [C: 03+1] db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641629 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [11:53:01] jouncebot: next [11:53:01] In 0 hour(s) and 6 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201118T1200) [11:53:19] no changes, nice [11:53:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641629 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [11:54:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641629 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [11:55:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1009" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641502 [11:56:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1009 and place pc1010 instead of it T266483 (duration: 01m 18s) [11:56:11] !log End of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=nlwiki; T246539) [11:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:17] !log Restart mysql on pc1009 T266483 [11:56:17] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [11:56:21] (03PS1) 10Jbond: puppet ca: cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) [11:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:23] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [11:56:24] !log Start of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=frwiki; T246539) [11:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:16] 10Operations, 10ops-eqiad: Invalid port info on asw2-d-eqiad - https://phabricator.wikimedia.org/T268125 (10ayounsi) And ge-6/0/6 which points to https://netbox.wikimedia.org/dcim/devices/2140/ [11:57:50] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1009" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641502 (owner: 10Marostegui) [11:57:52] (03CR) 10jerkins-bot: [V: 04-1] puppet ca: cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) (owner: 10Jbond) [11:58:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1009" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641502 (owner: 10Marostegui) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201118T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:14] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=blubberoid,name=eqiad [12:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1009 in pc3 after restarting mysql T266483 (duration: 01m 06s) [12:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:31] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad [12:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:31] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:42] hashar: should we deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/641501 ? [12:04:57] Lucas_WMDE: yes, should be good :) [12:05:02] ok \o/ [12:05:15] I don't know how to verify it though beside watching the related log spam [12:05:24] and I am going to have lunch with the kids [12:05:52] maybe kostajh can assist for the GrowthExperiments hot fix [12:06:14] I’ll see if I can reproduce the logspam on mwdebug [12:06:36] !log akosiaris@cumin1001 conftool action : set/ttl=300; selector: dnsdisc=wikifeeds [12:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:43] PROBLEM - Check the last execution of replicate-krb-database on krb1001 is CRITICAL: CRITICAL: Status of the systemd unit replicate-krb-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:07:11] ^ krb1001 is known, not critical and investigated ATM, will ack [12:07:17] hm, if this only happens when the job is instantiated by the executor, and not when it’s being scheduled, then it might not be testable on mwdebug [12:07:47] but the change looks safe enough that I’m willing to sync it and see if logspam goes down on the real systems [12:07:57] Lucas_WMDE: I think that's the best bet [12:08:08] ok, adding to calendar [12:08:20] the only think we can do is to verify Special:Homepage loads correctly [12:08:55] (03CR) 10Urbanecm: [C: 03+2] Regenerate Bengali Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640816 (https://phabricator.wikimedia.org/T265553) (owner: 10Zoranzoki21) [12:09:13] I'm going to sneak in a small config patch, given CI will take ~20 mins for the GE patch [12:09:21] ok [12:09:21] (03CR) 10Urbanecm: [C: 03+2] Fix NewcomerTasksCacheRefreshJob [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641501 (https://phabricator.wikimedia.org/T268008) (owner: 10Hashar) [12:09:26] (though it was more like 5 minutes yesterday iirc) [12:09:42] (03Merged) 10jenkins-bot: Regenerate Bengali Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/640816 (https://phabricator.wikimedia.org/T265553) (owner: 10Zoranzoki21) [12:09:50] interesting [12:11:49] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: 70aabf7ec8e1b549e78978e48967fb70d21316de: Regenerate Bengali Wikipedia logo (T265553) (duration: 01m 06s) [12:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:57] T265553: Regenerate Bengali Wikipedia logo (2) - https://phabricator.wikimedia.org/T265553 [12:13:18] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=releases [12:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:30] !log Purge https://en.wikipedia.org/static/images/project-logos/{bnwiki,bnwiki-1.5x,bnwiki-2x}.png (T265553) [12:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:52] ACKNOWLEDGEMENT - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Muehlenhoff Only affects kprop replication, under investigation https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:52] ACKNOWLEDGEMENT - Check the last execution of replicate-krb-database on krb1001 is CRITICAL: CRITICAL: Status of the systemd unit replicate-krb-database Muehlenhoff Only affects kprop replication, under investigation https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:20:05] (03Merged) 10jenkins-bot: Fix NewcomerTasksCacheRefreshJob [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641501 (https://phabricator.wikimedia.org/T268008) (owner: 10Hashar) [12:20:16] (03PS1) 10Alexandros Kosiaris: disc_desired_state: Add api-gateway [puppet] - 10https://gerrit.wikimedia.org/r/641715 [12:22:48] hi Lucas_WMDE ! [12:23:02] o/ I think I’m back (stupid vodafone) [12:23:04] * Lucas_WMDE reads log [12:23:22] no log, apart from the fact that the patch merged minutes ago [12:23:25] looks like I didn’t miss too much [12:23:27] ok [12:23:28] I'm fetching it to deployment host [12:24:42] Lucas_WMDE: syncing, please help watching logstash if you may :) [12:24:49] will do :) [12:25:27] RECOVERY - Long running screen/tmux on maps1004 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [12:25:31] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.18/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/NewcomerTasksCacheRefreshJob.php: 45d71a37f381e81e5382c8e10ac4063c9665beb8: Fix NewcomerTasksCacheRefreshJob (T268008) (duration: 01m 05s) [12:25:36] looks like the logspam spikes happen every five minutes or so, so we should be able to see the effect soon [12:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:42] T268008: Argument 2 passed to GrowthExperiments\NewcomerTasks\TaskSuggester\CacheDecorator::suggest() must be of the type array, null given, called in /srv/mediawiki/php-1.36.0-wmf.16/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/NewcomerTasksCacheRefreshJob.php on line 35 - https://phabricator.wikimedia.org/T268008 [12:25:46] if I backported both patches... [12:26:01] why both? [12:26:02] it seems to be in wmf.18 too [12:26:06] *wmf.16 [12:26:11] oh, right [12:26:13] hm [12:26:14] at least per title of T268008 [12:26:30] (03PS1) 10Urbanecm: Fix NewcomerTasksCacheRefreshJob [extensions/GrowthExperiments] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/641503 (https://phabricator.wikimedia.org/T268008) [12:26:33] yeah, you’re right, the top normalized_message has wmf.16 in the path [12:26:37] so attempt #2 [12:26:40] (03CR) 10Urbanecm: [C: 03+2] Fix NewcomerTasksCacheRefreshJob [extensions/GrowthExperiments] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/641503 (https://phabricator.wikimedia.org/T268008) (owner: 10Urbanecm) [12:26:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove auth DNS servers from network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/641708 (owner: 10Alexandros Kosiaris) [12:27:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Remove statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/641712 (owner: 10Alexandros Kosiaris) [12:27:09] Lucas_WMDE: let me take the waiting time to thank you for creating https://speedpatrolling.toolforge.org/. Seems like a cool tool! [12:27:29] thanks :) [12:27:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] disc_desired_state: Add api-gateway [puppet] - 10https://gerrit.wikimedia.org/r/641715 (owner: 10Alexandros Kosiaris) [12:29:54] (03Merged) 10jenkins-bot: Remove auth DNS servers from network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/641708 (owner: 10Alexandros Kosiaris) [12:29:56] (03Merged) 10jenkins-bot: calico: Remove statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/641712 (owner: 10Alexandros Kosiaris) [12:30:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] recommendation-api: Open up access to 3306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/641713 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [12:32:52] (03PS4) 10Kormat: orchestrator: Require ssl connections to db servers [puppet] - 10https://gerrit.wikimedia.org/r/639765 (https://phabricator.wikimedia.org/T267401) [12:33:21] (03Merged) 10jenkins-bot: recommendation-api: Open up access to 3306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/641713 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [12:33:49] (03CR) 10Kormat: "With the hostname handling changes, this now works correctly in pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/639765 (https://phabricator.wikimedia.org/T267401) (owner: 10Kormat) [12:34:20] (03CR) 10Jcrespo: "yay on tls!" [puppet] - 10https://gerrit.wikimedia.org/r/639765 (https://phabricator.wikimedia.org/T267401) (owner: 10Kormat) [12:37:12] (03Merged) 10jenkins-bot: Fix NewcomerTasksCacheRefreshJob [extensions/GrowthExperiments] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/641503 (https://phabricator.wikimedia.org/T268008) (owner: 10Urbanecm) [12:37:34] Lucas_WMDE: second attempt [12:37:52] ack [12:38:02] (03CR) 10Kormat: [C: 03+2] orchestrator: Require ssl connections to db servers [puppet] - 10https://gerrit.wikimedia.org/r/639765 (https://phabricator.wikimedia.org/T267401) (owner: 10Kormat) [12:38:31] (03PS1) 10Alexandros Kosiaris: Bump all _helpers.tpl link to 0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/641716 [12:38:33] (03PS1) 10Alexandros Kosiaris: Bump all default-network-policy-conf.yaml to 0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/641717 [12:39:05] syncing [12:40:08] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.16/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/NewcomerTasksCacheRefreshJob.php: 5488f56c7458fa8fb9be5f41f131e00b26a84cc0: Fix NewcomerTasksCacheRefreshJob (T268008) (duration: 01m 05s) [12:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:15] T268008: Argument 2 passed to GrowthExperiments\NewcomerTasks\TaskSuggester\CacheDecorator::suggest() must be of the type array, null given, called in /srv/mediawiki/php-1.36.0-wmf.16/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/NewcomerTasksCacheRefreshJob.php on line 35 - https://phabricator.wikimedia.org/T268008 [12:42:31] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [12:42:35] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [12:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:39] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [12:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:01] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [12:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:24] !log sync staging cluster's helmfile.d/admin state. Aside from calico, the rest is a noop [12:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:15] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 16865408 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:45:17] Lucas_WMDE: filtered to the GE exception only, i'd say it worked? https://usercontent.irccloud-cdn.com/file/NzbMCK8h/image.png [12:45:51] yeah, looks good to me \o/ [12:45:57] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1671384 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:46:16] thanks for deploying! [12:46:38] no prob :) [12:48:36] Urbanecm: Lucas_WMDE: thank you for the deployments! [12:48:55] no problem! [12:49:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bump all _helpers.tpl link to 0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/641716 (owner: 10Alexandros Kosiaris) [12:49:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bump all default-network-policy-conf.yaml to 0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/641717 (owner: 10Alexandros Kosiaris) [12:49:46] (03PS1) 10Jbond: puppet: suppress deprecation warnings [software/spicerack] - 10https://gerrit.wikimedia.org/r/641718 [12:51:39] (03Merged) 10jenkins-bot: Bump all _helpers.tpl link to 0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/641716 (owner: 10Alexandros Kosiaris) [12:51:43] (03Merged) 10jenkins-bot: Bump all default-network-policy-conf.yaml to 0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/641717 (owner: 10Alexandros Kosiaris) [12:57:41] 10Operations, 10DBA, 10Orchestrator, 10Patch-For-Review: orchestrator: Use ssl for talking to db servers - https://phabricator.wikimedia.org/T267401 (10Kormat) 05Open→03Resolved a:03Kormat Fixed by https://gerrit.wikimedia.org/r/639765. From the commit description: > The orchestrator docs are a bit... [13:07:32] Urbanecm: I think we forgot to !log the end of the window, btw? [13:07:49] true [13:08:00] !log EU B&C done (~15 minutes ago) [13:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:18] thanks ^^ [13:08:39] logstash still looks great btw [13:08:48] good! [13:13:54] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 4 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10BBlack) I'm not exactly sure as to why the pattern above emerged, but now I don't think it's relevant at al... [13:30:06] Phabricator-related request https://www.irccloud.com/pastebin/9CbN5AzG/ [13:30:21] !log installing openldap security updates on corp replicas [13:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:47] (03PS1) 10BBlack: cache_text: nuke_limit and large_objects_cutoff [puppet] - 10https://gerrit.wikimedia.org/r/641724 (https://phabricator.wikimedia.org/T266040) [13:33:04] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/641725 [13:34:09] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [13:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:27] !log cache_text: Executing "varnishadm -n frontend param.set nuke_limit 1000" - T266373 [13:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:34] T266373: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 [13:36:18] (03CR) 10BBlack: [C: 03+2] cache_text: nuke_limit and large_objects_cutoff [puppet] - 10https://gerrit.wikimedia.org/r/641724 (https://phabricator.wikimedia.org/T266040) (owner: 10BBlack) [13:38:34] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 5 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10CDanis) Adding to what Brandon says, we do have evidence that it happens on edge DCs other than just eqiad... [13:45:58] (03PS2) 10Jbond: puppet ca: cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) [13:47:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26501/console" [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) (owner: 10Jbond) [13:47:31] (03CR) 10jerkins-bot: [V: 04-1] puppet ca: cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) (owner: 10Jbond) [13:48:28] 10Operations, 10Desktop Improvements, 10Product-Infrastructure-Team-Backlog, 10Proton, and 5 others: Connection closed while downloading PDF of articles - https://phabricator.wikimedia.org/T266373 (10BBlack) The proposed changes are live now. It may take a a few hours to confirm that via NEL at our curren... [13:49:56] (03PS1) 10Alexandros Kosiaris: recommendation-api: Bump cpu resource request [deployment-charts] - 10https://gerrit.wikimedia.org/r/641727 [13:51:59] (03PS3) 10Jbond: puppet ca: cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) [13:53:28] (03CR) 10jerkins-bot: [V: 04-1] puppet ca: cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) (owner: 10Jbond) [13:54:17] (03PS1) 10Filippo Giunchedi: thanos: add query-frontend alerts [puppet] - 10https://gerrit.wikimedia.org/r/641729 (https://phabricator.wikimedia.org/T261281) [13:56:36] (03CR) 10Kormat: [C: 03+2] Update redirection of 2030.wikimedia.org with new URI [puppet] - 10https://gerrit.wikimedia.org/r/632552 (https://phabricator.wikimedia.org/T264797) (owner: 10Samuel (WMF)) [13:57:20] abbad: hi. i've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/632552 for you [13:58:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] recommendation-api: Bump cpu resource request [deployment-charts] - 10https://gerrit.wikimedia.org/r/641727 (owner: 10Alexandros Kosiaris) [14:00:04] dancy and hashar: (Dis)respected human, time to deploy Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201118T1400). Please do the needful. [14:00:27] (03PS1) 10Ppchelko: Move MW /w/rest.php traffic to api-appserver. [puppet] - 10https://gerrit.wikimedia.org/r/641730 (https://phabricator.wikimedia.org/T268043) [14:01:33] (03CR) 10Ppchelko: "Heads-up: I have absolutely no idea what I'm doing 😊" [puppet] - 10https://gerrit.wikimedia.org/r/641730 (https://phabricator.wikimedia.org/T268043) (owner: 10Ppchelko) [14:01:43] (03Merged) 10jenkins-bot: recommendation-api: Bump cpu resource request [deployment-charts] - 10https://gerrit.wikimedia.org/r/641727 (owner: 10Alexandros Kosiaris) [14:02:00] !log restart krb5-kpropd.service on krb2001 to force the pick up of new client configs [14:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:08] (03PS2) 10Ppchelko: Move MW /w/rest.php traffic to api-appserver. [puppet] - 10https://gerrit.wikimedia.org/r/641730 (https://phabricator.wikimedia.org/T268043) [14:03:54] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [14:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:26] !log installing openldap security updates on ro replicas [14:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:54] kormat Thank you! Should I be doing anything else at this point? Sorry to be so clueless [14:08:35] abbad: he is at meeting, may take to go back to you 20 min [14:09:03] (03PS1) 10Alexandros Kosiaris: recommendation-api: Also fix request.cpu in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/641731 [14:09:22] !log copied /etc/krb5.keytab from krb1001 to krb2001 (the last one contained only one principal for 2001, the first one both for 1001 and 2001) [14:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:48] (03PS4) 10Jbond: puppet ca: cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) [14:11:09] jynus: That's fine. Thanks [14:11:18] (03CR) 10jerkins-bot: [V: 04-1] puppet ca: cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) (owner: 10Jbond) [14:11:37] 10Operations, 10Wikimedia-Apache-configuration: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10Kormat) Hi. I have merged the gerrit change, but - i'm not sure how long it's going to take to take effect - the current redirect is a 301 (a 'permenant' redirect), so br... [14:12:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26502/console" [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) (owner: 10Jbond) [14:12:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] recommendation-api: Also fix request.cpu in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/641731 (owner: 10Alexandros Kosiaris) [14:12:46] (03PS5) 10Jbond: puppet ca: cert monitoring [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) [14:13:30] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/641714 (https://phabricator.wikimedia.org/T268104) (owner: 10Jbond) [14:13:57] !log Purge https://2030.wikimedia.org/ via purgeList.php (T264797) [14:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:03] T264797: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 [14:15:54] (03Merged) 10jenkins-bot: recommendation-api: Also fix request.cpu in helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/641731 (owner: 10Alexandros Kosiaris) [14:17:00] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [14:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:24] (03CR) 10Herron: [C: 03+2] aptrepo: add elastic710 component [puppet] - 10https://gerrit.wikimedia.org/r/641528 (owner: 10Herron) [14:21:32] (03CR) 10Filippo Giunchedi: "post-switchback I think we can merge this? what do you think?" [cookbooks] - 10https://gerrit.wikimedia.org/r/626403 (owner: 10Filippo Giunchedi) [14:21:58] (03PS1) 10Alexandros Kosiaris: recommendation-api: Remove some redundant values [deployment-charts] - 10https://gerrit.wikimedia.org/r/641734 [14:25:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] recommendation-api: Remove some redundant values [deployment-charts] - 10https://gerrit.wikimedia.org/r/641734 (owner: 10Alexandros Kosiaris) [14:26:39] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access for Jan Jaquemot - https://phabricator.wikimedia.org/T267771 (10herron) Hi @KFrancis, could you please confirm or coordinate an NDA for @JanJaquemot? Thanks in advance! [14:28:31] (03Merged) 10jenkins-bot: recommendation-api: Remove some redundant values [deployment-charts] - 10https://gerrit.wikimedia.org/r/641734 (owner: 10Alexandros Kosiaris) [14:28:56] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access for Till Mletzko - https://phabricator.wikimedia.org/T267744 (10herron) Hi @KFrancis, could you please confirm or coordinate an NDA for @tmletzko? Thanks in advance! [14:29:13] (03CR) 10Herron: [C: 03+2] admin: add ldap_only_entry for tobias-schumann-wmde-ext [puppet] - 10https://gerrit.wikimedia.org/r/641510 (https://phabricator.wikimedia.org/T267917) (owner: 10Herron) [14:29:35] (03CR) 10Herron: admin: add ldap_only_entry for tobias-schumann-wmde-ext [puppet] - 10https://gerrit.wikimedia.org/r/641510 (https://phabricator.wikimedia.org/T267917) (owner: 10Herron) [14:29:44] (03CR) 10Herron: [C: 03+2] admin: add ldap_only_entry for tobias-schumann-wmde-ext [puppet] - 10https://gerrit.wikimedia.org/r/641510 (https://phabricator.wikimedia.org/T267917) (owner: 10Herron) [14:29:49] (03CR) 10Herron: admin: add ldap_only_entry for tobias-schumann-wmde-ext [puppet] - 10https://gerrit.wikimedia.org/r/641510 (https://phabricator.wikimedia.org/T267917) (owner: 10Herron) [14:30:05] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [14:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:51] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [14:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:06] (03PS1) 10Jbond: bacula::client: notify service after installing certs [puppet] - 10https://gerrit.wikimedia.org/r/641736 (https://phabricator.wikimedia.org/T256454) [14:32:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26503/console" [puppet] - 10https://gerrit.wikimedia.org/r/641736 (https://phabricator.wikimedia.org/T256454) (owner: 10Jbond) [14:34:59] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [14:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:52] (03CR) 10Jbond: [V: 03+1] "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/641736 (https://phabricator.wikimedia.org/T256454) (owner: 10Jbond) [14:41:05] (03PS1) 10Elukey: kerberos: set dns_canonicalize_hostname = true for krb1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/641737 (https://phabricator.wikimedia.org/T257412) [14:42:21] (03PS2) 10Herron: admin: add ldap_only_entry for tobias-schumann-wmde-ext [puppet] - 10https://gerrit.wikimedia.org/r/641510 (https://phabricator.wikimedia.org/T267917) [14:44:59] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26505/console" [puppet] - 10https://gerrit.wikimedia.org/r/641737 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [14:45:09] (03CR) 10Herron: [C: 03+2] admin: add ldap_only_entry for tobias-schumann-wmde-ext [puppet] - 10https://gerrit.wikimedia.org/r/641510 (https://phabricator.wikimedia.org/T267917) (owner: 10Herron) [14:48:54] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP 'nda' access for Tobias Schumann - https://phabricator.wikimedia.org/T267917 (10herron) [14:49:27] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP 'nda' access for Tobias Schumann - https://phabricator.wikimedia.org/T267917 (10herron) 05Open→03Resolved a:03herron Hi @Tobias_Schumann_WMDE-ext, the requested access has been granted. I'll transition this to closed now, but please re-ope... [14:52:43] (03PS1) 10Reedy: logging.php: Monolog\Logger::setTimezone is no longer static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641740 (https://phabricator.wikimedia.org/T268141) [14:52:59] (03CR) 10Muehlenhoff: [C: 03+1] "Let's do that, one comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641737 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [14:53:38] (03PS2) 10Reedy: logging.php: Monolog\Logger::setTimezone is no longer static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641740 (https://phabricator.wikimedia.org/T268141) [14:53:49] (03CR) 10Jcrespo: "Oh, thank you so much! Do you think this fixes T256454 entirely? I guess we have to try it." [puppet] - 10https://gerrit.wikimedia.org/r/641736 (https://phabricator.wikimedia.org/T256454) (owner: 10Jbond) [14:54:05] (03PS2) 10Ppchelko: Switch ParserCache to JSON for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635607 (https://phabricator.wikimedia.org/T263579) [14:54:56] (03CR) 10Jcrespo: [C: 03+2] "Reloading client daemons, unlike reloading other bacula daemons is very safe- worst case scenario, it just cancels the backups." [puppet] - 10https://gerrit.wikimedia.org/r/641736 (https://phabricator.wikimedia.org/T256454) (owner: 10Jbond) [14:55:24] (03PS2) 10Jcrespo: bacula::client: notify service after installing certs [puppet] - 10https://gerrit.wikimedia.org/r/641736 (https://phabricator.wikimedia.org/T256454) (owner: 10Jbond) [14:57:42] (03PS1) 10Alexandros Kosiaris: recommendation-api: Allow overriding mysql_tables [deployment-charts] - 10https://gerrit.wikimedia.org/r/641742 (https://phabricator.wikimedia.org/T241230) [14:57:44] (03CR) 10Herron: [C: 03+2] admin: create swagoel account, add to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/641471 (https://phabricator.wikimedia.org/T267314) (owner: 10Herron) [14:57:51] (03PS2) 10Herron: admin: create swagoel account, add to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/641471 (https://phabricator.wikimedia.org/T267314) [14:58:40] (03PS1) 10Jbond: cfssl::cert: fix (re)sign command [puppet] - 10https://gerrit.wikimedia.org/r/641744 [14:59:20] (03CR) 10jerkins-bot: [V: 04-1] cfssl::cert: fix (re)sign command [puppet] - 10https://gerrit.wikimedia.org/r/641744 (owner: 10Jbond) [15:00:26] (03CR) 10Elukey: [V: 03+1] kerberos: set dns_canonicalize_hostname = true for krb1001/2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641737 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [15:00:58] (03PS2) 10Jbond: cfssl::cert: fix (re)sign command [puppet] - 10https://gerrit.wikimedia.org/r/641744 [15:01:38] (03PS2) 10Elukey: kerberos: set dns_canonicalize_hostname = true for krb1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/641737 (https://phabricator.wikimedia.org/T257412) [15:01:40] (03CR) 10Jbond: [C: 03+2] cfssl::cert: fix (re)sign command [puppet] - 10https://gerrit.wikimedia.org/r/641744 (owner: 10Jbond) [15:02:45] (03PS1) 10Ottomata: eventgate-analyitcs-external - bump to 2020-11-18-143227-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/641745 (https://phabricator.wikimedia.org/T266573) [15:03:03] (03PS4) 10Filippo Giunchedi: wikimedia.org: add alertmanager records for API [dns] - 10https://gerrit.wikimedia.org/r/641697 (https://phabricator.wikimedia.org/T266017) [15:03:25] !log Purge https://2030.wikimedia.org/ via purgeList.php (T264797) [15:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:33] T264797: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 [15:04:55] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analyitcs-external - bump to 2020-11-18-143227-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/641745 (https://phabricator.wikimedia.org/T266573) (owner: 10Ottomata) [15:05:03] 10Operations, 10Wikimedia-Apache-configuration: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10Urbanecm) 05Open→03Resolved a:03Kormat Did a second HTCP purge after Puppet propagated it everywhere, and it seems to work: ` urbanecm@titanium ~ $ curl --silent... [15:05:29] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10herron) 05Open→03Resolved a:03herron Hi @Swagoel, the requested access has been granted and will be fully active wit... [15:05:35] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [15:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] recommendation-api: Allow overriding mysql_tables [deployment-charts] - 10https://gerrit.wikimedia.org/r/641742 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [15:07:06] (03PS1) 10Jbond: cfssl::cert: fix (re)sign command [puppet] - 10https://gerrit.wikimedia.org/r/641746 [15:07:45] (03CR) 10Jbond: [C: 03+2] cfssl::cert: fix (re)sign command [puppet] - 10https://gerrit.wikimedia.org/r/641746 (owner: 10Jbond) [15:09:25] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [15:09:25] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [15:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:33] Urbanecm: thanks! :) [15:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:44] no problem kormat ! [15:09:46] (03PS2) 10Herron: admin: create ldap_only account for stran [puppet] - 10https://gerrit.wikimedia.org/r/641463 (https://phabricator.wikimedia.org/T267968) [15:09:51] (03CR) 10Hnowlan: Move MW /w/rest.php traffic to api-appserver. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641730 (https://phabricator.wikimedia.org/T268043) (owner: 10Ppchelko) [15:10:36] (03Merged) 10jenkins-bot: recommendation-api: Allow overriding mysql_tables [deployment-charts] - 10https://gerrit.wikimedia.org/r/641742 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [15:11:45] (03CR) 10Herron: [C: 03+2] admin: create ldap_only account for stran [puppet] - 10https://gerrit.wikimedia.org/r/641463 (https://phabricator.wikimedia.org/T267968) (owner: 10Herron) [15:11:59] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [15:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:31] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [15:12:31] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [15:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:50] (03CR) 10Ppchelko: Move MW /w/rest.php traffic to api-appserver. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641730 (https://phabricator.wikimedia.org/T268043) (owner: 10Ppchelko) [15:12:51] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [15:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:02] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [15:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:24] 10Operations, 10Wikimedia-Apache-configuration: Update 2030.wikimedia.org redirect to new URI - https://phabricator.wikimedia.org/T264797 (10sguebo_WMF) >>! In T264797#6630512, @Kormat wrote: > Hi. I have merged the gerrit change, but > - i'm not sure how long it's going to take to take effect > - the current... [15:14:51] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add STran to `wmf` LDAP group - https://phabricator.wikimedia.org/T267968 (10herron) 05Open→03Resolved a:03herron Hi @STran, you have been added to the `wmf` LDAP group. I'll transition this to closed now, but please reopen if any follow-up is... [15:15:51] service-checker-swagger 10.64.65.2 http://recommendation-api:9632 [15:15:52] All endpoints are healthy [15:15:55] finally... [15:16:31] !log mwscript deleteEqualMessages.php --wiki=cswiki --delete [15:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:54] (03PS1) 10BBlack: authdns: copy geoip data rather than symlink [puppet] - 10https://gerrit.wikimedia.org/r/641747 (https://phabricator.wikimedia.org/T252577) [15:17:56] (03PS1) 10BBlack: fetch maxmind geoip daily instead of weekly [puppet] - 10https://gerrit.wikimedia.org/r/641748 (https://phabricator.wikimedia.org/T252577) [15:18:33] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [15:18:34] (03PS2) 10Herron: admin: create ldap_only entry for ijethrobt [puppet] - 10https://gerrit.wikimedia.org/r/641465 (https://phabricator.wikimedia.org/T267962) [15:20:14] (03CR) 10Herron: [C: 03+2] admin: create ldap_only entry for ijethrobt [puppet] - 10https://gerrit.wikimedia.org/r/641465 (https://phabricator.wikimedia.org/T267962) (owner: 10Herron) [15:20:59] (03CR) 10BBlack: [C: 03+2] "I've tested this against a live machine with "puppet apply", and it does indeed replace the current symlink correctly, and does nothing wh" [puppet] - 10https://gerrit.wikimedia.org/r/641747 (https://phabricator.wikimedia.org/T252577) (owner: 10BBlack) [15:21:35] bblack: please feel free to multiple mine [15:21:39] herron: ok [15:21:54] ty ty [15:22:19] [done merging] [15:22:41] (03PS1) 10Alexandros Kosiaris: conftool: Add recommendation-api to kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/641749 (https://phabricator.wikimedia.org/T241230) [15:22:53] 10Operations, 10DBA, 10Orchestrator, 10User-Kormat: Explore orchestrator hooks to integrate them with dbctl, !log, irc alerts and emails - https://phabricator.wikimedia.org/T266452 (10Kormat) One thing that's not currently clear how to handle is starting/stopping pt-heartbeat on masters. [15:23:45] 10Operations, 10Puppet, 10Data-Persistence-Backup, 10Patch-For-Review: Missing dependency on bacula-fd Puppet setup - https://phabricator.wikimedia.org/T256454 (10jcrespo) Answering the question I made on the previous patch, my suspicion is that it should avoid future issues like T268104 after deploy, but... [15:29:13] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Request for LDAP Access in order to access Superset for IJethroBT-WMF - https://phabricator.wikimedia.org/T267962 (10herron) 05Open→03Resolved a:03herron Hi @IJethroBT-WMF, the requested access has been granted. I'll transition this to closed n... [15:29:34] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) s2, s6, and s7 have also finished. The s3 worker has completed wikis up through ruwikibooks (in alphabetical order). [15:30:01] (03PS2) 10Alexandros Kosiaris: conftool: Add recommendation-api to kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/641749 (https://phabricator.wikimedia.org/T241230) [15:30:03] (03PS1) 10Alexandros Kosiaris: lvs: Add new TLS enabled recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/641750 (https://phabricator.wikimedia.org/T241230) [15:33:54] (03PS1) 10Ayounsi: Fix bug preventing devices with no interfaces to be setup [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641752 [15:34:25] (03CR) 10jerkins-bot: [V: 04-1] Fix bug preventing devices with no interfaces to be setup [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641752 (owner: 10Ayounsi) [15:35:37] (03PS2) 10Ayounsi: Fix bug preventing devices with no interfaces to be setup [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641752 [15:36:14] (03CR) 10Ayounsi: [C: 03+2] Fix bug preventing devices with no interfaces to be setup [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641752 (owner: 10Ayounsi) [15:36:45] (03PS1) 10Alexandros Kosiaris: recommendation-api: Switch to using envoy based discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/641753 (https://phabricator.wikimedia.org/T241230) [15:37:39] (03CR) 10BBlack: [C: 03+2] fetch maxmind geoip daily instead of weekly [puppet] - 10https://gerrit.wikimedia.org/r/641748 (https://phabricator.wikimedia.org/T252577) (owner: 10BBlack) [15:39:27] (03CR) 10CDanis: [C: 03+1] Remove the legacy assert_headers regex format, which is unused. [software/httpbb] - 10https://gerrit.wikimedia.org/r/641567 (owner: 10RLazarus) [15:41:03] 10Operations, 10Traffic, 10Patch-For-Review: Maxmind data update issues for DNS (and others?) - https://phabricator.wikimedia.org/T252577 (10BBlack) 05Open→03Resolved a:03BBlack This should be fixed now! [15:42:19] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:59] (03CR) 10BryanDavis: [C: 03+1] logging.php: Monolog\Logger::setTimezone is no longer static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641740 (https://phabricator.wikimedia.org/T268141) (owner: 10Reedy) [15:46:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] recommendation-api: Switch to using envoy based discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/641753 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [15:46:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] conftool: Add recommendation-api to kubernetes nodes [puppet] - 10https://gerrit.wikimedia.org/r/641749 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [15:47:39] (03PS1) 10Mforns: analytics::refinery::job::refine.pp: Add transform function to netflow [puppet] - 10https://gerrit.wikimedia.org/r/641754 (https://phabricator.wikimedia.org/T254332) [15:48:23] 10Operations, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10Papaul) Connected the device on scs-a1 on port 47 still no connection to serial [15:48:42] (03Merged) 10jenkins-bot: recommendation-api: Switch to using envoy based discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/641753 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [15:51:34] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [15:51:38] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10Trizek-WMF) So far so good! :) [15:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:27] (03CR) 10RLazarus: [C: 03+2] Remove the legacy assert_headers regex format, which is unused. [software/httpbb] - 10https://gerrit.wikimedia.org/r/641567 (owner: 10RLazarus) [15:54:56] (03Merged) 10jenkins-bot: Remove the legacy assert_headers regex format, which is unused. [software/httpbb] - 10https://gerrit.wikimedia.org/r/641567 (owner: 10RLazarus) [15:56:00] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [15:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:15] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [15:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:46] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10RobH) [16:00:48] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10RobH) a:03RobH [16:00:54] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10RobH) [16:02:52] (03CR) 10BBlack: [C: 03+2] TLS unified public cert: switch non-us to dc-2020 [puppet] - 10https://gerrit.wikimedia.org/r/641424 (https://phabricator.wikimedia.org/T261419) (owner: 10BBlack) [16:03:25] RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:05:41] 10Operations, 10ops-codfw: asw-c7-codfw: fan alarm - https://phabricator.wikimedia.org/T268105 (10Papaul) 05Open→03Resolved unseat/re-seat the fan ` papaul@asw-c-codfw> show chassis alarms No alarms currently active [16:07:10] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-c-codfw.mgmt.codfw.wmnet recovered from Juniper alarm active [16:07:34] (03CR) 10Elukey: [C: 03+2] kerberos: set dns_canonicalize_hostname = true for krb1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/641737 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [16:10:19] 10Operations, 10Puppet, 10Data-Persistence-Backup, 10Patch-For-Review: Missing dependency on bacula-fd Puppet setup - https://phabricator.wikimedia.org/T256454 (10jbond) >>! In T256454#6630672, @jcrespo wrote: > Answering the question I made on the previous patch, my suspicion is that it should avoid futur... [16:10:58] Hi. I would love to have access to Wikimedia logstash as I'm not available here: https://wikitech.wikimedia.org/wiki/LDAP/Groups. My account is: https://meta.wikimedia.org/wiki/User:DAlangi_(WMF) [16:11:20] Not sure who to contact for this so Jdlrobson recommended I drop a message here. Thank you very much [16:14:58] xSavitar: hi! we handle those requests on Phab, could you please open a ticket here, requesting to be added to the "wmf" group https://phabricator.wikimedia.org/project/profile/1564/ [16:15:28] rzl: Okay, thanks so much! Will file a task now [16:15:34] 👍 [16:15:49] just leave it unassigned, and the SRE on clinic duty this week will grab it soon [16:16:10] (03PS1) 10Dzahn: site: add canary appserver role on mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/641756 (https://phabricator.wikimedia.org/T267248) [16:17:31] (03PS1) 10Dzahn: add mwdebug1003 to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/641757 (https://phabricator.wikimedia.org/T267248) [16:17:50] (03PS1) 10Hnowlan: api-gateway: use Envoy 1.16 everywhere. [deployment-charts] - 10https://gerrit.wikimedia.org/r/641758 [16:18:40] (03PS1) 10Dzahn: trafficserver: add mwdebug1003 to x-wikimedia-debug-routing map [puppet] - 10https://gerrit.wikimedia.org/r/641759 (https://phabricator.wikimedia.org/T267248) [16:20:33] (03PS1) 10Dzahn: DHCP: add mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/641760 (https://phabricator.wikimedia.org/T267248) [16:23:38] (03CR) 10Ppchelko: [C: 03+1] "wooowhoooh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/641758 (owner: 10Hnowlan) [16:24:02] (03CR) 10Jforrester: [C: 03+1] logging.php: Monolog\Logger::setTimezone is no longer static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641740 (https://phabricator.wikimedia.org/T268141) (owner: 10Reedy) [16:25:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:25:39] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [16:26:27] jouncebot: now [16:26:27] No deployments scheduled for the next 2 hour(s) and 33 minute(s) [16:26:29] jouncebot: next [16:26:29] In 2 hour(s) and 33 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201118T1900) [16:26:29] In 2 hour(s) and 33 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201118T1900) [16:27:32] !log robh@cumin1001 START - Cookbook sre.dns.netbox [16:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:48] (03CR) 10Reedy: [C: 03+2] "Let's give this a go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641740 (https://phabricator.wikimedia.org/T268141) (owner: 10Reedy) [16:28:37] (03Merged) 10jenkins-bot: logging.php: Monolog\Logger::setTimezone is no longer static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641740 (https://phabricator.wikimedia.org/T268141) (owner: 10Reedy) [16:28:50] Guess I should test that quickly ;P [16:29:46] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Alangi Derick (DAlangi_WMF) - https://phabricator.wikimedia.org/T268150 (10DAlangi_WMF) [16:31:12] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Alangi Derick (DAlangi_WMF) - https://phabricator.wikimedia.org/T268150 (10DAlangi_WMF) [16:31:15] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:19] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:00] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Alangi Derick (DAlangi_WMF) - https://phabricator.wikimedia.org/T268150 (10DAlangi_WMF) [16:35:43] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [16:35:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:36:46] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:05] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:31] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:38:09] !log reedy@deploy1001 Synchronized wmf-config/logging.php: T268141 (duration: 01m 06s) [16:38:11] (03PS1) 10Elukey: Revert "kerberos: set dns_canonicalize_hostname = true for krb1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/641505 [16:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:16] T268141: Scap beta failing on mweval - https://phabricator.wikimedia.org/T268141 [16:38:40] (03CR) 10Hnowlan: [C: 03+2] api-gateway: use Envoy 1.16 everywhere. [deployment-charts] - 10https://gerrit.wikimedia.org/r/641758 (owner: 10Hnowlan) [16:40:17] (03CR) 10Elukey: [C: 03+2] Revert "kerberos: set dns_canonicalize_hostname = true for krb1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/641505 (owner: 10Elukey) [16:41:29] (03Merged) 10jenkins-bot: api-gateway: use Envoy 1.16 everywhere. [deployment-charts] - 10https://gerrit.wikimedia.org/r/641758 (owner: 10Hnowlan) [16:42:08] (03PS1) 10RobH: an-tool1010 setup [puppet] - 10https://gerrit.wikimedia.org/r/641764 (https://phabricator.wikimedia.org/T268146) [16:42:27] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [16:43:44] (03CR) 10RobH: [C: 03+2] an-tool1010 setup [puppet] - 10https://gerrit.wikimedia.org/r/641764 (https://phabricator.wikimedia.org/T268146) (owner: 10RobH) [16:43:58] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:43:58] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:43:59] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Alangi Derick (DAlangi_WMF) - https://phabricator.wikimedia.org/T268150 (10RLazarus) @DAlangi_WMF Drive-by comment: If you only need Logstash, you don't need shell access. (Getting shell access involves a lot more steps for you the requeste... [16:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:10] 10Operations, 10ops-eqiad: Invalid port info on asw2-d-eqiad - https://phabricator.wikimedia.org/T268125 (10Cmjohnson) 05Open→03Resolved ge-3/0/22 is in eth1 on and ge-3/0/23 is in eth 3 on restbase1018. (this is reflected on the switch) xe-4/0/0 is a new ms-be host that I am setting up now ge-6/0/6 is... [16:44:38] (03PS1) 10Elukey: Revert "Revert "kerberos: set dns_canonicalize_hostname = true for krb1001/2001"" [puppet] - 10https://gerrit.wikimedia.org/r/641766 [16:46:26] yes yes I know [16:46:41] I hope that kormat doesn't see the double revert [16:47:02] :D [16:47:11] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Alangi Derick (DAlangi_WMF) - https://phabricator.wikimedia.org/T268150 (10DAlangi_WMF) [16:47:11] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:15] (03CR) 10Elukey: [C: 03+2] Revert "Revert "kerberos: set dns_canonicalize_hostname = true for krb1001/2001"" [puppet] - 10https://gerrit.wikimedia.org/r/641766 (owner: 10Elukey) [16:47:44] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Alangi Derick (DAlangi_WMF) - https://phabricator.wikimedia.org/T268150 (10DAlangi_WMF) Thanks for the suggestion. Actually, I need access *only* to Logstash for now. [16:47:48] (03PS3) 10Jbond: wmflib::os_version: drop the os_version and requires_os functions [puppet] - 10https://gerrit.wikimedia.org/r/640418 (https://phabricator.wikimedia.org/T267396) [16:48:09] (03CR) 10Dzahn: "ACK, I will wait with doing more "require_package" replacements. It was just that I saw your large change after I had merged some smaller " [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [16:48:49] (03CR) 10Dzahn: [C: 03+2] "Thanks reviewers! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/475453 (https://phabricator.wikimedia.org/T204993) (owner: 10Alex Monk) [16:49:21] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:49:21] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:31] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:52] !log update /etc/krb5.keytab on krb1001/krb2001 to match the most up to date key version for host/krb2001.codfw.wmnet [16:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:03] mutante: fyi replacing require_packages is definetly something that i appreciate, just saying to ignore my massive change that tries to do it all in one go as i need to rework that into at least a smaller set of changes [16:51:28] but anything you do in the mean time is definetly appreciated [16:51:47] (03CR) 10Jbond: [C: 03+2] wmflib::os_version: drop the os_version and requires_os functions [puppet] - 10https://gerrit.wikimedia.org/r/640418 (https://phabricator.wikimedia.org/T267396) (owner: 10Jbond) [16:52:24] !lof drop os_version/requiers_os functions from wmflib [16:52:28] !log drop os_version/requiers_os functions from wmflib [16:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:46] jbond42: heh! ok, I was going back and forth between "I don't really want to make a huge patch doing it all at once" and "don't go too far in making 100 changes and breaking his existing one" in this specific case :) [16:54:11] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [16:54:40] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10RobH) [16:54:49] RECOVERY - Check the last execution of replicate-krb-database on krb1001 is OK: OK: Status of the systemd unit replicate-krb-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:54:50] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) Server is back down and ready for maintenance after the backup. [16:55:03] mutante: yes im not sure i want to make a highe PS my self so please continue to make small changes as and when you see them :) thanks <3