[00:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T0000). [00:00:05] ebernhardson: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:17] Just me? i can deploy it [00:00:33] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@1c97543]: Bump mjolnir to master: Revert stream gzip decompression [00:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:38] (that deploy is unrelated) [00:01:15] (03PS4) 10Bstorm: toolforge: remove special configuration for kubernetes on proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/576469 (https://phabricator.wikimedia.org/T214513) [00:03:26] (03PS2) 10EBernhardson: [cirrus] configure wgCirrusSearchMaxShardsPerNode per cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575020 (owner: 10DCausse) [00:03:28] (03CR) 10Bstorm: "This version should work https://puppet-compiler.wmflabs.org/compiler1001/21245/tools-proxy-05.tools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/576469 (https://phabricator.wikimedia.org/T214513) (owner: 10Bstorm) [00:03:35] (03CR) 10EBernhardson: [C: 03+2] [cirrus] configure wgCirrusSearchMaxShardsPerNode per cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575020 (owner: 10DCausse) [00:04:49] (03Merged) 10jenkins-bot: [cirrus] configure wgCirrusSearchMaxShardsPerNode per cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575020 (owner: 10DCausse) [00:05:57] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@1c97543]: Bump mjolnir to master: Revert stream gzip decompression (duration: 05m 25s) [00:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:42] !log post-deployment restart mjolnir-kafka-bulk-daemon across eqiad and codfw [00:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:59] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] configure wgCirrusSearchMaxShardsPerNode per cluster (duration: 01m 05s) [00:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:27] (03CR) 10EBernhardson: [C: 03+2] [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 (owner: 10DCausse) [00:09:30] (03PS7) 10EBernhardson: [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 (owner: 10DCausse) [00:09:32] (03CR) 10EBernhardson: [C: 03+2] [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 (owner: 10DCausse) [00:10:51] (03Merged) 10jenkins-bot: [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 (owner: 10DCausse) [00:13:23] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] move similarity settings to IS.php (duration: 01m 04s) [00:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:14] (03PS1) 10Dzahn: installserver: add parameter for DHCP interface [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) [00:14:40] (03CR) 10jerkins-bot: [V: 04-1] installserver: add parameter for DHCP interface [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:15:01] !log ebernhardson@deploy1001 Synchronized wmf-config/SearchSettingsForWikibase.php: [cirrus] move similarity settings to IS.php (duration: 01m 05s) [00:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:32] (03PS2) 10Dzahn: installserver: add parameter for DHCP interface [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) [00:16:14] (03PS3) 10Dzahn: installserver: add parameter for DHCP interface [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) [00:16:31] hmm, multiple undeployed patches in wmf.22 [00:17:15] wmf.21 is clean at least :) [00:17:18] (03CR) 10jerkins-bot: [V: 04-1] installserver: add parameter for DHCP interface [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:20:33] or actually...nm. It's the difference between two dots and three in HEAD..origin/wmf/1.35.0-wmf.22, should be the double dotted versoin [00:22:22] 10Operations, 10serviceops-radar, 10vm-requests: vm requests for APT repo / webserver - https://phabricator.wikimedia.org/T244626 (10Dzahn) 05Open→03Resolved a:03Dzahn The VMs have been created meanwhile. [00:22:26] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) [00:23:35] !log ebernhardson@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: [cirrus] Match fallback config key with the one used in cirrus (duration: 01m 04s) [00:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:08] (03PS4) 10Dzahn: installserver: add parameter for DHCP interface [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) [00:24:24] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) I guess the plan has been sorted out. We are in the middle of implementing it and the details are on T224576. Instead of adding some changes here and some in both places we... [00:24:54] (03CR) 10RLazarus: cumin: Replace apache-fast-test with httpbb in reimage scripts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576464 (owner: 10RLazarus) [00:25:24] !log ebernhardson@deploy1001 Synchronized php-1.35.0-wmf.21/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: [cirrus] Match fallback config key with the one used in cirrus (duration: 01m 03s) [00:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:49] (03CR) 10jerkins-bot: [V: 04-1] installserver: add parameter for DHCP interface [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:27:11] with that, swat is complete [00:36:39] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) >>! In T242602#5859841, @ayounsi wrote: >> One use case I have of the install1002 server is: >> .. >> Fetch it over https with for example: `file copy "https://install1002.wi... [00:43:37] (03PS1) 10RLazarus: httpbb: Replace apache-fast-test with httpbb in deploy_apache_change. [puppet] - 10https://gerrit.wikimedia.org/r/576485 [00:43:51] (03PS5) 10Dzahn: installserver: add parameter for DHCP interface [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) [00:47:23] (03PS2) 10RLazarus: httpbb: Replace apache-fast-test with httpbb in deploy_apache_change. [puppet] - 10https://gerrit.wikimedia.org/r/576485 [00:48:21] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.293e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [00:50:28] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Jclark-ctr) [00:50:43] (03CR) 10RLazarus: "PCC looks correct: https://puppet-compiler.wmflabs.org/compiler1001/21249/" [puppet] - 10https://gerrit.wikimedia.org/r/576485 (owner: 10RLazarus) [00:52:21] (03PS3) 10RLazarus: httpbb: Replace apache-fast-test with httpbb in deploy_apache_change. [puppet] - 10https://gerrit.wikimedia.org/r/576485 [00:55:03] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [00:58:53] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Dwisehaupt) [01:06:24] (03PS1) 10Krinkle: tests: Reduce 'family' assertion to just 'wiki-suffix disambig' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576489 (https://phabricator.wikimedia.org/T169821) [01:06:26] (03PS1) 10Krinkle: [WIP] MWConfigCacheGenerator: Stop reading most wiki-family dblist files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576490 (https://phabricator.wikimedia.org/T169821) [01:09:55] (03PS7) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) [01:09:59] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10Jclark-ctr) [01:11:41] (03CR) 10Krinkle: "The diff is suspicious, hence WIP for now. I must've missed something." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576490 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [01:14:29] (03PS1) 10Dzahn: add hiera keys for parsoid-php on deployment-parsoid11 [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) [01:14:50] (03CR) 10CRusnov: "As we have discussed and agreed, here are the changes." (035 comments) [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [01:21:33] PROBLEM - SSH ganeti2003.mgmt on ganeti2003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:25:02] 10Operations, 10netbox, 10Patch-For-Review: Netbox report check for no position set in rack - https://phabricator.wikimedia.org/T239244 (10crusnov) In testing it seems as though the 0U height successfully prevents false negatives, but is there something additional you're trying to test for by specifying that... [01:30:15] !log ganeti2003 - mgmt interface stopped responding on SSH, resetting DRAC via bmc-device from the host [01:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:14] !log ganeti2003 - DRAC reset failed with "ipmi_cmd_cold_reset: BMC busy" [01:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:36] 10Operations, 10ops-codfw: ganeti2003.mgmt - please check connectivity - https://phabricator.wikimedia.org/T246857 (10Dzahn) [01:34:26] 10Operations, 10ops-codfw: ganeti2003.mgmt - stopped responding on SSH - please reset DRAC/BMC? - https://phabricator.wikimedia.org/T246857 (10Dzahn) [01:35:02] 10Operations, 10ops-codfw: ganeti2003.mgmt - stopped responding on SSH - please reset DRAC/BMC? - https://phabricator.wikimedia.org/T246857 (10Dzahn) p:05Triage→03Medium [01:35:55] ACKNOWLEDGEMENT - SSH ganeti2003.mgmt on ganeti2003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T246857 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:02] 10Operations, 10ops-codfw: ganeti2003.mgmt - stopped responding on SSH - please reset DRAC/BMC? - https://phabricator.wikimedia.org/T246857 (10Dzahn) There are actually 2 related alerts that both point at DRAC/BMC issue. SSH on mgmt (CRIT) https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=g... [01:47:14] (03PS1) 10CRusnov: reports/coherence.py: Add check for Juniper inventory item descriptions [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576499 (https://phabricator.wikimedia.org/T241289) [01:47:45] RECOVERY - PHP7 rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 73036 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:47:51] (03CR) 10Krinkle: [C: 04-1] "Eh, the diff is due to no longer triggering a really bad bug. See T246858 for details." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576490 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [01:48:19] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:48:32] !log mw1315 - restarted php-fpm and apache (was alerting in Icinga with 503 for 12 hours), log showed failed coredumps, restarts recovered it [01:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:07] RECOVERY - Nginx local proxy to apache on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:55:44] !log mw2290 - systemctl reset-failed to clear (CRITICAL: Status of the systemd unit php7.2-fpm_check_restart) [01:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:45] !log mw2178 - systemctl reset-failed to clear (CRITICAL: Status of the systemd unit php7.2-fpm_check_restart) [01:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:31] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw2290 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:57:31] RECOVERY - Check the last execution of php7.2-fpm_check_restart on mw2178 is OK: OK: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:57:31] RECOVERY - Check systemd state on mw2178 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:31] RECOVERY - Check systemd state on mw2290 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:59:45] off [02:16:01] (03CR) 10Jforrester: [C: 03+2] tests: Reduce 'family' assertion to just 'wiki-suffix disambig' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576489 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [02:17:16] (03Merged) 10jenkins-bot: tests: Reduce 'family' assertion to just 'wiki-suffix disambig' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576489 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [02:47:20] (03CR) 10Jforrester: "Manually applied to new deployment-parsoid11 in Beta Cluster; doesn't work yet, needs to fix the inheritance of groups as currently they'r" [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [03:13:14] (03PS1) 10CDanis: finish removing og & pd from icinga configs [puppet] - 10https://gerrit.wikimedia.org/r/576513 [03:16:22] (03CR) 10CDanis: [C: 03+2] finish removing og & pd from icinga configs [puppet] - 10https://gerrit.wikimedia.org/r/576513 (owner: 10CDanis) [03:18:34] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [03:21:47] (03PS1) 10Jforrester: [WiP] Provide infrastructure to create InitialiseSettings.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576514 [03:39:14] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10CDanis) [03:39:32] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10CDanis) p:05Triage→03High [03:44:25] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10CDanis) Target definitions as expanded by puppet are identical on both servers too. Very weird. [05:04:48] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 4 (install1003, ...), Fresh: 93 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [05:51:48] (03PS2) 10Jforrester: [WiP] Provide infrastructure to create InitialiseSettings.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576514 [06:05:34] (03PS1) 10Marostegui: mariadb: Reimage db1098 [puppet] - 10https://gerrit.wikimedia.org/r/576539 (https://phabricator.wikimedia.org/T246604) [06:06:00] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [06:09:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db1098 [puppet] - 10https://gerrit.wikimedia.org/r/576539 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [06:10:16] !log Stop MySQL on db1098:3316, db1098:3317 for upgrade - T246604 [06:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:22] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [06:14:28] (03CR) 10Marostegui: "> the non-instanced class on the default port may need the same" [puppet] - 10https://gerrit.wikimedia.org/r/576398 (https://phabricator.wikimedia.org/T242702) (owner: 10Jcrespo) [06:21:56] !log ✔️ cdanis@prometheus2004.codfw.wmnet ~ 🕝☕ sudo systemctl reload prometheus@ops [06:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:40] !log ✔️ cdanis@prometheus2004.codfw.wmnet ~ 🕝☕ sudo systemctl restart prometheus@ops [06:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:16] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10CDanis) 05Open→03Resolved a:03CDanis netstat confirmed that prom2004 wasn't even trying {P10604} A restart fixed it. I have no idea. [06:27:44] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10Vgutierrez) tcpdump & netstat confirmed that prometheus2004 wasn't even trying to connect to lvs2007|8:9100, a restart has fixed it. [06:28:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [06:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:32] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10CDanis) Oh, also NB that a simple `reload` was **not** sufficient to fix (tried that first). [06:40:11] (03PS1) 10Marostegui: Revert "install_server: Allow manual reimage db109[6-9]" [puppet] - 10https://gerrit.wikimedia.org/r/576542 [06:40:29] (03PS2) 10Marostegui: Revert "install_server: Allow manual reimage db109[6-9]" [puppet] - 10https://gerrit.wikimedia.org/r/576542 [06:42:15] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow manual reimage db109[6-9]" [puppet] - 10https://gerrit.wikimedia.org/r/576542 (owner: 10Marostegui) [06:42:51] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.301e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [06:45:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1098:3316 and db1098:3317 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10605 and previous config saved to /var/cache/conftool/dbconfig/20200304-064520-marostegui.json [06:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:25] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [06:46:06] (03PS1) 10Marostegui: db1098: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/576543 (https://phabricator.wikimedia.org/T246604) [06:47:56] (03CR) 10Marostegui: [C: 03+2] db1098: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/576543 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [06:52:59] (03CR) 10Vgutierrez: [C: 03+1] profile::tcp_fast_open: create tiny profile [puppet] - 10https://gerrit.wikimedia.org/r/576279 (owner: 10Giuseppe Lavagetto) [06:53:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::tcp_fast_open: create tiny profile [puppet] - 10https://gerrit.wikimedia.org/r/576279 (owner: 10Giuseppe Lavagetto) [06:55:10] (03Abandoned) 10Vgutierrez: Rename lvs[2001-2006] interface dependent hostnames [dns] - 10https://gerrit.wikimedia.org/r/428888 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [06:56:54] (03PS1) 10Vgutierrez: Edit Project Config [debs/trafficserver] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/576545 [06:57:18] (03Abandoned) 10Vgutierrez: Edit Project Config [debs/trafficserver] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/576545 (owner: 10Vgutierrez) [06:58:09] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [06:59:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::service_proxy: absent everywhere [puppet] - 10https://gerrit.wikimedia.org/r/576280 (owner: 10Giuseppe Lavagetto) [07:00:34] (03PS1) 10Marostegui: install_server: Simplify non-srv format recipe [puppet] - 10https://gerrit.wikimedia.org/r/576548 [07:00:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1098:3316 and db1098:3317 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10606 and previous config saved to /var/cache/conftool/dbconfig/20200304-070048-marostegui.json [07:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:54] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [07:07:46] (03PS1) 10Marostegui: es2 hosts: Change them to standalone [puppet] - 10https://gerrit.wikimedia.org/r/576549 (https://phabricator.wikimedia.org/T246072) [07:08:55] (03CR) 10Marostegui: [C: 04-2] "Wait until https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/576286/ is pushed and hosts are set to read only and replicati" [puppet] - 10https://gerrit.wikimedia.org/r/576549 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [07:09:05] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Set es2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576286 (https://phabricator.wikimedia.org/T246072) [07:13:39] (03CR) 10Marostegui: [C: 03+2] install_server: Simplify non-srv format recipe [puppet] - 10https://gerrit.wikimedia.org/r/576548 (owner: 10Marostegui) [07:14:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1098:3316 and db1098:3317 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10607 and previous config saved to /var/cache/conftool/dbconfig/20200304-071443-marostegui.json [07:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:48] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [07:19:11] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:14] (03PS1) 10Vgutierrez: Release 8.0.6-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/576553 [07:19:26] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.6-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/576553 (owner: 10Vgutierrez) [07:20:11] PROBLEM - DPKG on deploy1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:21:40] (03Abandoned) 10Vgutierrez: Release 8.0.6-rc1-1wm1 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/573221 (owner: 10Vgutierrez) [07:23:09] (03PS2) 10Vgutierrez: Release 8.0.6-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/576553 [07:23:19] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.6-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/576553 (owner: 10Vgutierrez) [07:28:55] 10Operations, 10Performance-Team (Radar): eqiad: (1) misc single cpu server allocation for performance browser testing - https://phabricator.wikimedia.org/T204589 (10Peter) Adding @dpifke since he is the one that we have been waiting on joining, so you have the history :) [07:31:17] (03PS3) 10Vgutierrez: Release 8.0.6-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/576553 [07:32:01] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.6-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/576553 (owner: 10Vgutierrez) [07:32:35] <_joe_> deploy1001 is me [07:33:39] RECOVERY - DPKG on deploy1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:37:09] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:12] (03PS1) 10Marostegui: install_server: Allow reimage db1103 [puppet] - 10https://gerrit.wikimedia.org/r/576590 (https://phabricator.wikimedia.org/T246604) [07:37:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1098:3316 and db1098:3317 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10608 and previous config saved to /var/cache/conftool/dbconfig/20200304-073721-marostegui.json [07:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:27] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [07:41:26] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy: remove from puppet [puppet] - 10https://gerrit.wikimedia.org/r/576591 [07:51:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> @akosiaris Does it make sense to start with the expansion before replacing existing servers?" [puppet] - 10https://gerrit.wikimedia.org/r/576406 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [07:51:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] "PCC is at https://puppet-compiler.wmflabs.org/compiler1003/21231/" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo) [07:52:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "nrpe: Don't set PrivateTmp=True" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo) [07:57:09] (03CR) 10Jcrespo: "Be careful, this also allows reimage of db1100. If not wanted, change db110[1-2] into db110[0-2]." [puppet] - 10https://gerrit.wikimedia.org/r/576590 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [07:59:35] any issue so far with icinga? [08:01:02] (03CR) 10Marostegui: "> Be careful, this also allows reimage of db1100. If not wanted," [puppet] - 10https://gerrit.wikimedia.org/r/576590 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:02:01] (03PS2) 10Marostegui: install_server: Allow reimage db1103 [puppet] - 10https://gerrit.wikimedia.org/r/576590 (https://phabricator.wikimedia.org/T246604) [08:04:35] (03CR) 10Jcrespo: [C: 03+1] install_server: Allow reimage db1103 [puppet] - 10https://gerrit.wikimedia.org/r/576590 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:06:59] (03PS1) 10Elukey: profile::swap: add auto_ferm settings for rsync [puppet] - 10https://gerrit.wikimedia.org/r/576614 [08:13:30] !log START warm cache for db1111 & db1126 for Q20-25 million T219123 (pass 1 today) [08:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:35] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [08:14:15] (03CR) 10Elukey: [C: 03+2] profile::swap: add auto_ferm settings for rsync [puppet] - 10https://gerrit.wikimedia.org/r/576614 (owner: 10Elukey) [08:15:07] akosiaris: o/ - I was about to puppet-merge and I saw your puppet-disable msg, can I proceed or better to wait? [08:16:03] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:12] elukey: I 'd rather you waited a bit. I can prioritise the hosts you are interested in though [08:16:57] akosiaris: nono the change is really low priority, no problem to wait [08:19:37] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:20:19] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:20:42] <_joe_> XioNoX: ^^ [08:20:50] <_joe_> this eeems like a serious issue [08:21:07] looking [08:21:21] <_joe_> elukey: we're about to get paged btw [08:21:34] <_joe_> err s/elukey/Xionox/ [08:21:46] <_joe_> I'll prepare an eqsin depool [08:22:02] +1 [08:22:04] to prepare [08:22:14] I don't see anything wrong so far on both routers [08:22:18] still looking [08:23:07] (03PS1) 10Giuseppe Lavagetto: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/576621 [08:23:10] v4 and v6 mtr to text-lb.eqsin work fine from here [08:23:22] _joe_: what's the issue? [08:23:31] <_joe_> it just recobvered it seems [08:23:44] <_joe_> we had timeouts in connecting to upload-lb in eqsin [08:23:53] <_joe_> both from monitoring and my home connection [08:24:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:24:03] <_joe_> but it recovefred now, as did the bgp alert [08:26:35] _joe_: was it v4 or v6? [08:26:41] 10Operations, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10MoritzMuehlenhoff) [08:26:48] https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas shows an increase of latency for v6 [08:28:37] 10Operations, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10Joe) a:03RLazarus yes they do, they also released 1.12.3 I think we can move to 1.13 and slowly rollout the change. Assigning to our envoy-build expert in residence :P [08:29:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::services_proxy: remove from puppet [puppet] - 10https://gerrit.wikimedia.org/r/576591 (owner: 10Giuseppe Lavagetto) [08:30:55] PROBLEM - Check systemd state on db1098 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:34] (03PS1) 10Marostegui: install_server: Reimage db1103 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576624 (https://phabricator.wikimedia.org/T246604) [08:33:51] ACKNOWLEDGEMENT - Check systemd state on db1098 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marostegui This is being fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/576398/ https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/513201 (owner: 10Dzahn) [08:34:42] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1103 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576624 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:35:07] _joe_ elukey good to merge your changes? [08:36:22] marostegui: I am ok, I was waiting for akosiaris to complete his maintenance for nrpe [08:36:33] ah ok [08:36:35] (puppet was disabled on puppet master) [08:36:35] same here then [08:36:41] (03CR) 10Muehlenhoff: "Or we simply postpone until lists1001 is live and fermium has been decommed?" [puppet] - 10https://gerrit.wikimedia.org/r/576333 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [08:36:43] My change can wait [08:36:57] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:37:14] all known ^, I am proceeding with codfw in parts btw [08:37:35] akosiaris: fyi, my change is good to be merged anytime [08:37:36] hoping to be done in about 15mins. The transient eqsin issue did not help ofc [08:37:43] ok, good to know, thanks! [08:41:05] !log running puppet on first db host after merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464601/, db2086, rescheduling icinga checks as well [08:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:26] !log running puppet on first es host after merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464601/, db2019, rescheduling icinga checks as well [08:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:35] !log running puppet on first es host after merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464601/, es2019, rescheduling icinga checks as well (correction) [08:41:37] dbs should be ok, it is other services I worry [08:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:00] jynus: still, it doesn't hurt to be careful [08:42:08] o, don't disagree [08:42:18] I will also want to restart some services to confirm [08:42:43] PROCS CRITICAL: 0 processes with regex args '^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site' [08:42:46] this one is really weird [08:42:58] it's on gerrit1002 and shows up and disappears every now and then [08:43:06] that was alerting before [08:43:16] I belive it may be in progress [08:43:16] it's a damn easy check.. what on earth. It's also not related, I haven't touched eqiad yet [08:43:35] because it had alerts silenced [08:43:40] <_joe_> akosiaris: I suggest you enable puppet everywhere and we'll manage whatever breaks [08:43:50] _joe_: I am pretty close to that threshold [08:43:57] <_joe_> the risk is just a shower of alerts in case right? [08:44:03] <_joe_> we can live with that AIUI [08:44:04] more or less yes [08:44:20] <_joe_> akosiaris: try one mw host, I fear the check we added with hugh uses /tmp [08:44:43] I am on those hosts right now :-) [08:45:44] !log running puppet on first mw host after merge of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464601/, mw2269, rescheduling icinga checks as well [08:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:27] I am also doing entire clusters as well btw, just not logging them as they are really not that interesting [08:49:31] (03PS1) 10KartikMistry: Add apertium-pol-szl package [debs/contenttranslation/apertium-pol-szl] - 10https://gerrit.wikimedia.org/r/576628 (https://phabricator.wikimedia.org/T202276) [08:50:34] jynus: ● gerrit.service - Gerrit code review tool [08:50:34] Loaded: loaded (/lib/systemd/system/gerrit.service; enabled; vendor preset: enabled) [08:50:34] Active: active (running) since Wed 2020-03-04 08:50:21 UTC; 1s ago [08:50:40] couldn't help myself, I just had a look [08:50:45] it is being constantly restarted [08:50:47] mutante: ^^ [08:51:00] that's on gerrit1002 btw [08:51:19] 1002 is the non active one, I think [08:51:40] yeah but whether it's active or not the service should be restarted every couple of secs [08:51:48] should NOT * [08:51:53] oh, I agree [08:51:57] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:52:11] just calming things user-impact wise [08:52:49] I am ~5mins from enabling puppet across the fleet I think [08:54:10] !log START warm cache for db1111 & db1126 for Q20-25 million T219123 (pass 2 today) [08:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:15] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [08:54:47] akosiaris: Cool, we have a maintenance window scheduled for setting external store in read only, I will wait for you to be done though. No rush, we have time for it [08:55:11] es2 only, to be clear [08:55:23] ok, will be on time [08:55:30] no rush, we have enough time [08:55:34] the window is large enough [08:57:01] oh no worries. I think that I 've tested enough to avoid any large alert storms. If anything alerts it will be small things [08:57:24] that can be handled on their own at our leisure [08:58:27] !log release Giant Puppet Lock across the fleet. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464601/ has made it's way to all PoPs and most of codfw without issues, will make it in the rest of the fleet in the next 30mins [08:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:04] elukey: marostegui: _joe_: I 've just merged all of your changes [08:59:16] \o/ [08:59:31] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:59:37] I 'll be monitoring it during the next 1h but it looks pretty ok up to now [08:59:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:00:04] marostegui and jynus: Your horoscope predicts another unfortunate es2 database read-only deployment deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T0900). [09:00:17] Thanks for the trust jouncebot [09:00:23] another? [09:00:32] how many times do we plan to do it???? [09:00:35] :-) [09:00:35] haha [09:00:36] 42 [09:00:40] :P [09:00:57] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:01:04] So the idea is to merge https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/576286/ and right before deploying set read_only on mysql first and then deploy [09:01:07] to avoid any race condition [09:01:08] one sec [09:01:21] how is icinga deployment going? [09:01:48] {{done}} for now? [09:02:03] I believe so, per akosiaris last comment? [09:02:07] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:02:24] jynus: you are clear to go. my lock in released [09:02:27] is* [09:02:28] ok [09:02:32] just to be 100% sure [09:02:37] I have an issue with typos today it seems [09:02:40] jynus: please confirm the host is getting read_only on mysql would be es1015 [09:02:47] es2 master [09:02:48] we should wait a bit before doing that [09:02:52] after deploy [09:03:03] confirm no connections writing/binlog [09:03:12] as it may take some seconds [09:03:25] On the other hand, we could have race conditions [09:03:39] well, writing extra is not a big deal [09:03:47] as long as it stopps eventually [09:03:57] but failing writes is losing content [09:04:09] how would you coordinate puppet changes? [09:04:10] yeah, but what if you write to es2 and then attempt to read that same content from es4? [09:04:18] no that cannot happen [09:04:26] it stores the location on metadata [09:04:30] ah cool [09:04:34] then we are good [09:04:45] (03CR) 10Marostegui: db-eqiad,db-codfw.php: Set es2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576286 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:04:46] it stores on text table "clusterXX" [09:04:57] Then, let's take a last look at https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/576286/ [09:04:59] that is why I would want to wait so metadata + data are in sync [09:05:04] makes sense yeah [09:05:25] after all, es is way more limited on what write type it has [09:05:28] compared to metadata [09:05:36] (only edits) [09:06:09] what about puppet, how to coordinate read only change with alert + puppet? [09:06:27] (I know that is later) [09:06:28] That's not a big deal, we can disable the alert for 5 minutes [09:06:32] ok [09:06:48] Set read only, disconnect replication and all that within that time [09:07:00] (03CR) 10Jcrespo: [C: 03+1] db-eqiad,db-codfw.php: Set es2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576286 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:07:12] let's go then for the MW deployment [09:07:14] I think doing codfw first should be ok [09:07:17] yep [09:07:21] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Set es2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576286 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:07:27] not in a rush for each step, I think [09:07:40] vs a metadata switchover [09:07:40] yeah [09:07:43] exactly [09:07:57] let's monitor closely the binlog before indeed setting it to read_only [09:08:02] hopefully it won't take hours :) [09:08:18] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Set es2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576286 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:09:41] waiting for scap now with metrics and logs [09:09:58] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Set es2 as RO - T246072 (duration: 01m 14s) [09:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:03] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:10:03] Going to browse the site with mwdebug2001 [09:10:27] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:30] ok [09:10:41] (03PS1) 10Elukey: role::swap: list all hosts allowed to rsync home dirs [puppet] - 10https://gerrit.wikimedia.org/r/576632 [09:11:35] es and en looking fine, going to browse some other projects [09:12:08] we can mybe find a revision from the old host, give a second [09:12:10] (03CR) 10Filippo Giunchedi: "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [09:12:37] (03CR) 10Elukey: [C: 03+2] "Andrew: I am inclined to just use this fix for the moment, since ferm rules seem to not like wildcards. When we'll merge swap with stat bo" [puppet] - 10https://gerrit.wikimedia.org/r/576632 (owner: 10Elukey) [09:12:54] I am browsing very old revision from commons main page [09:13:09] <_joe_> !log removing nginx from servers where it was just used for service proxying. [09:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:11] PROBLEM - DPKG on deploy2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:14:24] <_joe_> that is me probably [09:15:00] cluster 24 is the one going read only, right? [09:15:03] commons and wiktionary looking good to [09:15:05] jynus: correct [09:16:27] RECOVERY - DPKG on deploy2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:16:50] arg, content_address is not indexed, so it takes a lot to go backwards [09:17:07] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:24] we'll just say it is ok [09:17:34] deploy on codfw [09:17:45] I think it is ok, I haven't found anything strange on my browsing or logstash [09:18:14] let's deploy in eqiad? [09:20:17] ok [09:20:27] ok, deploying [09:21:21] Monitoring es1015 connections and binlog [09:21:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set es2 as RO - T246072 (duration: 01m 04s) [09:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:43] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:22:00] no more wikisuer connections for now [09:22:44] well, connections should be around, ah you mean on master? [09:22:49] Monitoring binlog after position 749789971 [09:22:52] yeah, master ones gone [09:22:53] yeah, I mean the master [09:23:00] but replicas should still connect [09:23:06] yes, of course [09:23:26] those are working fine [09:23:33] no more writes so far in binlog [09:23:44] i think the jobqueue cannot create edits [09:23:51] so less issues in this case [09:24:11] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10fgiunchedi) Curious indeed, thanks for investigating! I took a quick look and it looks like prometheus2004 didn't even know about lvs2007 before today at ~6.23, so I'm suspec... [09:24:13] no errors on metrics, log [09:24:18] let me check edit rates, etc [09:24:30] also, try editing [09:25:36] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=es1&var-shard=es2&var-shard=es3&var-shard=es4&var-role=All [09:25:39] edit count is stable [09:25:45] this is looking logical, es2 decreasing writes [09:25:48] but reads are the same [09:26:30] we need more time to check if save timing got affected [09:26:46] yes, let's give it more minutes [09:26:52] binlog looking clean too [09:27:38] (03PS1) 10Muehlenhoff: Remove system::role from role::prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/576634 [09:30:07] last enwiki edit to use cluster24 was over 1000 edits ago, and counting [09:30:33] yeah, there is nothing on binlogs after the deployment was finished and I captured the position [09:30:36] I would give you the timestamp, but as I said before, I lack the indexes to backtrack efficiently [09:30:37] just heartbeat [09:31:37] we also have no errors or anything apparently [09:31:44] the rest are 50/50 between cluster25 and 26 [09:32:32] that's good [09:33:45] in the last 24 hours [09:33:51] I see 8 errors on fetchblob [09:34:02] all from mwmaint1002 [09:34:26] that's probably from yesterday's test? [09:34:37] no, wikidata, 21-1am [09:34:40] ah [09:34:44] unrelated I guess [09:35:01] definitely not user impacting, but I wonder who was testing? [09:35:27] /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/rebuildItemTerms.php was run [09:36:01] ^addshore do you know who could have ran that? [09:36:12] me [09:36:22] * addshore reads up [09:37:07] (03PS3) 10Alexandros Kosiaris: facilities:monitor_pdu_service: Add types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/576390 [09:37:09] (03PS1) 10Alexandros Kosiaris: discovery: Add eventgate-analytics-external to check [puppet] - 10https://gerrit.wikimedia.org/r/576636 (https://phabricator.wikimedia.org/T233629) [09:37:33] it failed, and had no consequences, but I wonder if that may be broken because of our changes? [09:37:35] that script has been running constantly for a rather long time, it is the item term store batch migration script [09:37:44] * addshore goes to check the logs [09:37:56] but if it restarts, it should have gotten the right config [09:38:02] if it failed for a few pages / entities then that is fine :) [09:38:12] yeah, not worried about consequences [09:38:15] ack! it will restart at the end of each 1 million items which can take a bit of time [09:38:19] I am just bringing it up [09:38:35] thanks! should all be okay as we will do a second pass over any holes anyway! thanks for the heads up! [09:38:45] !log START warm cache for db1111 & db1126 for Q20-25 million T219123 (pass 3 today) [09:38:46] because of our change [09:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:51] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [09:39:14] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) Also sent https://salsa.debian.org/openstack-team/services/neutron-dynamic-routing/-/merge_requests/2 [09:39:17] marostegui: so read_only dynamically and on puppet + topology changes? [09:39:44] jynus: yeah, can you double check this? while I downtime the hosts and confirm hostnames and all that: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/576549/ [09:39:58] (03PS1) 10Alexandros Kosiaris: lvs: Switch eventgate-analytics-external to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/576637 (https://phabricator.wikimedia.org/T233629) [09:41:45] (03CR) 10Jcrespo: [C: 03+1] es2 hosts: Change them to standalone [puppet] - 10https://gerrit.wikimedia.org/r/576549 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:41:57] After the read only, I will kill pt-heartbeat and grab the coordinates just in case [09:41:57] hots and changes there are ok [09:42:05] I wonder if there is something we will be missing [09:42:07] (03CR) 10Marostegui: es2 hosts: Change them to standalone [puppet] - 10https://gerrit.wikimedia.org/r/576549 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:42:32] (03PS1) 10Addshore: Read from the new term store up to Q25 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576638 (https://phabricator.wikimedia.org/T219123) [09:42:33] update zarcillo a bit [09:42:47] it would require no update no? [09:42:59] anyways, we can check zarcillo and tendril later [09:43:15] so, setting read_only on es1015, confirm it is the right host? [09:43:20] es1015: set global read_only=1; [09:43:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Switch eventgate-analytics-external to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/576637 (https://phabricator.wikimedia.org/T233629) (owner: 10Alexandros Kosiaris) [09:43:51] also making sure pt-heatbeat stops [09:43:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] discovery: Add eventgate-analytics-external to check [puppet] - 10https://gerrit.wikimedia.org/r/576636 (https://phabricator.wikimedia.org/T233629) (owner: 10Alexandros Kosiaris) [09:43:56] !log Set es1015 (es2 master) on read_only - T246072 [09:43:59] (03CR) 10Marostegui: [C: 03+2] es2 hosts: Change them to standalone [puppet] - 10https://gerrit.wikimedia.org/r/576549 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:01] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:44:04] checking logs [09:44:19] done [09:44:26] going to stop pt-heartbeat [09:44:54] !log installing python-bleach security updates [09:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:11] (03PS2) 10Alexandros Kosiaris: discovery: Add eventgate-analytics-external to check [puppet] - 10https://gerrit.wikimedia.org/r/576636 (https://phabricator.wikimedia.org/T233629) [09:45:13] (03PS2) 10Alexandros Kosiaris: lvs: Switch eventgate-analytics-external to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/576637 (https://phabricator.wikimedia.org/T233629) [09:45:17] pt-heartbeat stopped [09:45:28] binlog not advancing [09:45:44] did puppet run with the new stuff already? [09:45:48] yes [09:46:11] let me run it on icinga too [09:46:22] (03PS1) 10Muehlenhoff: Remove system::role from role::logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/576639 [09:46:38] pasted the positions: https://phabricator.wikimedia.org/T246072#5940429 [09:46:53] yeah, that could be useful [09:46:57] thanks [09:47:10] going to run puppet on the other hosts too [09:47:58] warnings that "lag was starting to build up" :-) [09:48:05] (03PS1) 10Elukey: elasticsearch: add https:// to relforge endpoints [software/spicerack] - 10https://gerrit.wikimedia.org/r/576641 [09:48:07] hehe [09:48:13] should be gone by now, right? [09:48:27] when replication gets resetted [09:48:46] I don't think we differenciate standalone host regarding replication checks [09:48:54] maybe we should [09:49:51] (03PS1) 10Volans: debmonitor: make icinga check desc unique [puppet] - 10https://gerrit.wikimedia.org/r/576642 [09:49:54] ok, confirm the following: es1011, es1013, es2016, es2014, es2016: stop slave; reset slave all; [09:50:07] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: include pubkey for osbpo repository [puppet] - 10https://gerrit.wikimedia.org/r/576643 (https://phabricator.wikimedia.org/T246671) [09:50:38] puppet finished running on all es2 hosts [09:51:09] i saw some read only errors, but they were because spureous lag on commons metadata, not es [09:51:54] rows written went to absolute 0 on grafana: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=es1&var-shard=es2&var-shard=es3&var-shard=es4&var-shard=es5&var-role=All&from=1583311895862&to=1583315495862&panelId=7&fullscreen [09:52:10] makes sense, after pt-heartbeat stopped [09:53:49] let's reset replication? [09:53:52] yep [09:53:57] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/21255/" [puppet] - 10https://gerrit.wikimedia.org/r/576634 (owner: 10Muehlenhoff) [09:54:03] ok, confirm this please: es1011, es1013, es2016, es2014, es2016: stop slave; reset slave all; [09:54:13] let me double check the hosts [09:54:17] yup [09:54:23] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.52:4692]) https://wikitech.wikimedia.org/wiki/PyBal [09:54:51] hosts ok, command ok [09:55:04] cool [09:55:15] !log Reset replication on es2 hosts - T246072 [09:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:20] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:55:29] ok, done [09:55:42] show slave status is now empty for all of them [09:56:25] (03PS1) 10Alexandros Kosiaris: lvs: Switch eventgate-analytics-external to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/576646 (https://phabricator.wikimedia.org/T233629) [09:56:27] (03CR) 10Volans: [C: 03+1] "LGTM unless there is something special to that cluster that I'm missing" [software/spicerack] - 10https://gerrit.wikimedia.org/r/576641 (owner: 10Elukey) [09:56:31] let me re-schedule a lag check on one of the hosts [09:56:35] wait [09:56:46] es2016, es2014, es2016 [09:56:57] all that is ok, but one is repeated and one missing [09:56:58] fixing that, should be es2015 [09:57:19] done [09:57:58] also let's increase weight of old master [09:58:04] (at some point) [09:58:06] doing it [09:58:15] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/21256/" [puppet] - 10https://gerrit.wikimedia.org/r/576639 (owner: 10Muehlenhoff) [09:58:24] (03PS1) 10Alexandros Kosiaris: lvs: Switch eventgate-analytics-external to production [puppet] - 10https://gerrit.wikimedia.org/r/576647 (https://phabricator.wikimedia.org/T233629) [09:58:24] will do 50, and then 100 [09:58:47] +1 [09:59:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give some weight to es2 master es1015 and es2016, now standalone - T246072', diff saved to https://phabricator.wikimedia.org/P10609 and previous config saved to /var/cache/conftool/dbconfig/20200304-095919-marostegui.json [09:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:33] es1011 lag alert cleared [09:59:48] same for es1013 [10:00:09] (03CR) 10Volans: [C: 03+1] "Ship it!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576464 (owner: 10RLazarus) [10:00:21] all codfw hosts also clear [10:00:35] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:01:10] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove system::role from role::prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/576634 (owner: 10Muehlenhoff) [10:01:40] (03PS1) 10Elukey: elasticsearch: return the cluster name in __str__ for ElasticsearchCluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/576650 [10:03:56] (03CR) 10Volans: [C: 04-1] "I think 2 values got swapped between them." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [10:04:01] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10elukey) a:05elukey→03Jclark-ctr Assigning to @Jclark-ctr since the info have already been filled (thanks!) [10:04:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove system::role from role::prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/576634 (owner: 10Muehlenhoff) [10:04:51] jynus: I think we are good [10:04:58] I am going to check zarcillo and tendril to see what'd need update [10:05:07] I think nothing on tendril [10:05:17] but maybe a master on zarcillo or something [10:05:31] I will check [10:05:34] Going to close the window then [10:05:52] !log es2 maintenance window over T246072 [10:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:57] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [10:06:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/21258/" [puppet] - 10https://gerrit.wikimedia.org/r/576643 (https://phabricator.wikimedia.org/T246671) (owner: 10Arturo Borrero Gonzalez) [10:06:18] Thanks for your help :) [10:06:36] maybe removing that one line from "masters" table? [10:06:53] but no big deal [10:07:03] yeah, going to check how is it with es1 [10:07:38] I think if there is no master, it is reported as a replica for prometheus [10:08:05] 2 lines actually [10:08:14] one for eqiad and one for codfw [10:08:49] on tendril we just need to change the display to 0 [10:08:52] on the shards table [10:10:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/576642 (owner: 10Volans) [10:10:45] (03CR) 10Vgutierrez: [C: 04-1] netbox: Add framework for exposing scripts to internal services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [10:10:47] !log Update shards table to set es2 display=0 - T246072 [10:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:44] (03PS3) 10Jbond: role::lists: use mod_cgid on buster instead for mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/576333 (https://phabricator.wikimedia.org/T242910) [10:12:23] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Workaround upstream package regression [puppet] - 10https://gerrit.wikimedia.org/r/576398 (https://phabricator.wikimedia.org/T242702) [10:12:25] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Add es2 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576651 (https://phabricator.wikimedia.org/T246072) [10:12:42] (03CR) 10Volans: [C: 03+2] debmonitor: make icinga check desc unique [puppet] - 10https://gerrit.wikimedia.org/r/576642 (owner: 10Volans) [10:12:59] (03CR) 10Vgutierrez: [C: 04-1] netbox: Add framework for exposing scripts to internal services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [10:13:02] marostegui: it is hardcoded for now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/576651 [10:13:19] but we can think a way of putting it on the db at a later time [10:13:33] (03CR) 10Jbond: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576333 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [10:16:56] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Add es2 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576651 (https://phabricator.wikimedia.org/T246072) [10:17:25] jynus: Ah I see, yeah, we can do that later indeed [10:17:27] (03PS2) 10Elukey: elasticsearch: return the cluster name in __str__ for ElasticsearchCluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/576650 [10:17:44] (03CR) 10Marostegui: [C: 03+1] prometheus-mysqld-exporter: Add es2 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576651 (https://phabricator.wikimedia.org/T246072) (owner: 10Jcrespo) [10:17:53] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] "builds as expected on boron" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/576553 (owner: 10Vgutierrez) [10:18:12] marostegui: I already sent a patch [10:18:20] for es2 and another for es3 [10:18:40] it doesn't affect monitoring, just the classification on the drop down menus [10:18:46] so very low priority [10:18:59] jynus: and already +1ed! :) [10:19:00] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Add es3 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576655 (https://phabricator.wikimedia.org/T246072) [10:19:09] ^that for tomorrow?? :-D [10:19:17] haha no way! [10:19:33] well, when you want it, it is already there [10:19:35] I was thinking about doing es5 tuesday and es3 wednesday next week, just like today [10:19:45] ok to me [10:19:50] I will create invites and all that [10:19:57] going to clean up masters table on zarcillo [10:20:05] I can see a small unseen issue on obscure wikis nobody edits [10:20:10] *happening [10:20:15] not see it at the moment [10:20:18] so ok to wait [10:20:39] !log Remove es2 eqiad and codfw from zarcillo.masters table - T246072 [10:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:44] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [10:21:58] (03CR) 10Volans: [C: 03+1] "From a quick chat with elukey and seeing the output seems sane to me, unless the info() call is for some reason expensive (I hope not)." [software/spicerack] - 10https://gerrit.wikimedia.org/r/576650 (owner: 10Elukey) [10:22:00] (03PS3) 10Jcrespo: prometheus-mysqld-exporter: Add es2 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576651 (https://phabricator.wikimedia.org/T246072) [10:22:02] (03CR) 10Muehlenhoff: "The fix looks fine in general, but I think we can do it much simpler/cleaner, see comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [10:23:27] (03CR) 10Jcrespo: [C: 03+2] prometheus-mysqld-exporter: Add es2 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576651 (https://phabricator.wikimedia.org/T246072) (owner: 10Jcrespo) [10:23:37] (03PS4) 10Jcrespo: prometheus-mysqld-exporter: Add es2 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576651 (https://phabricator.wikimedia.org/T246072) [10:29:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me (running this against prod via Cumin will probably show a few more edge cases until it can be deployed anyway, but looks " [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [10:32:16] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Add es3 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576655 (https://phabricator.wikimedia.org/T246072) [10:32:18] (03PS1) 10Jbond: gradle: upgrade cas software to cas 6.1.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576656 [10:33:00] (03CR) 10Jcrespo: [C: 04-2] "Not before next week, after deploy." [puppet] - 10https://gerrit.wikimedia.org/r/576655 (https://phabricator.wikimedia.org/T246072) (owner: 10Jcrespo) [10:33:15] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [10:33:34] 10Operations, 10Packaging, 10puppet-compiler, 10User-jbond: PCC always has an ERROR when compiling for servers with profile::redis::slave - https://phabricator.wikimedia.org/T228266 (10hashar) [10:33:41] marostegui: is your slot all done? [10:33:41] that's me, fixing, sorry about that [10:33:45] addshore: yep! [10:33:47] (the icinga one) [10:33:49] great! [10:34:01] (03CR) 10Addshore: [C: 03+2] Read from the new term store up to Q25 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576638 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [10:34:04] jouncebot: now [10:34:04] For the next 0 hour(s) and 25 minute(s): es2 database read-only deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T0900) [10:35:04] (03Merged) 10jenkins-bot: Read from the new term store up to Q25 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576638 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [10:36:51] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q25M for the new term store everywhere (was Q20M) + warm db1126 & db1111 caches (T219123) (duration: 01m 05s) [10:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:56] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [10:38:03] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q25M for the new term store everywhere (was Q20M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 01m 04s) [10:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:53] !log upload trafficserver 8.0.6-1wm1 to apt.wm.o (buster) [10:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:14] (03PS1) 10Volans: icinga: add check_https_redirect missing command [puppet] - 10https://gerrit.wikimedia.org/r/576657 [10:41:46] !log START warm cache for db1111 & db1126 for Q25-30 million T219123 (pass 1) [10:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576657 (owner: 10Volans) [10:42:40] (03CR) 10Volans: [C: 03+2] icinga: add check_https_redirect missing command [puppet] - 10https://gerrit.wikimedia.org/r/576657 (owner: 10Volans) [10:42:47] let's see if I fix ti :D [10:42:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: remove special configuration for kubernetes on proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/576469 (https://phabricator.wikimedia.org/T214513) (owner: 10Bstorm) [10:44:05] (03PS5) 10Arturo Borrero Gonzalez: toolforge: remove special configuration for kubernetes on proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/576469 (https://phabricator.wikimedia.org/T214513) (owner: 10Bstorm) [10:49:50] ACKNOWLEDGEMENT - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - commonswiki_content_1582399079(65gb) Elukey T246882 https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [10:52:08] !log upgrading ATS to version 8.0.6 on ulsfo [10:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:49] (03PS1) 10DCausse: [cirrus] use 2 shards for commonswiki_content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576659 (https://phabricator.wikimedia.org/T246882) [10:54:48] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [10:56:31] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-pol-szl] - 10https://gerrit.wikimedia.org/r/576628 (https://phabricator.wikimedia.org/T202276) (owner: 10KartikMistry) [10:56:34] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 400 Bad Request https://wikitech.wikimedia.org/wiki/Debmonitor [10:56:41] (03CR) 10jerkins-bot: [V: 04-1] Add apertium-pol-szl package [debs/contenttranslation/apertium-pol-szl] - 10https://gerrit.wikimedia.org/r/576628 (https://phabricator.wikimedia.org/T202276) (owner: 10KartikMistry) [10:57:23] (03PS1) 10Addshore: Write to the new terms store up to Q 86 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576660 (https://phabricator.wikimedia.org/T219123) [10:57:27] jouncebot: now [10:57:28] For the next 0 hour(s) and 2 minute(s): es2 database read-only deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T0900) [10:57:31] jouncebot: next [10:57:31] In 1 hour(s) and 2 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T1200) [10:57:42] (03CR) 10Addshore: [C: 03+2] Write to the new terms store up to Q 86 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576660 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [10:58:05] * volans checking the new debmonitor check [10:58:44] (03Merged) 10jenkins-bot: Write to the new terms store up to Q 86 million [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576660 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [10:59:15] (03PS2) 10DCausse: [cirrus] use 2 shards for commonswiki_content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576659 (https://phabricator.wikimedia.org/T246882) [11:00:31] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Write to new term store up to Q86 million, was 84 (T219123) (duration: 01m 04s) [11:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:37] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [11:01:40] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Write to new term store up to Q86 million, was 84 (T219123) cache bust (duration: 01m 03s) [11:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:05] * addshore is done for now [11:05:17] (03PS2) 10KartikMistry: Add apertium-pol-szl package [debs/contenttranslation/apertium-pol-szl] - 10https://gerrit.wikimedia.org/r/576628 (https://phabricator.wikimedia.org/T202276) [11:06:08] (03PS1) 10Arturo Borrero Gonzalez: role: openstack: eqiad1: net: cleanup old comments [puppet] - 10https://gerrit.wikimedia.org/r/576661 [11:06:55] (03PS2) 10Jbond: gradle: upgrade cas software to cas 6.1.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576656 [11:08:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] role: openstack: eqiad1: net: cleanup old comments [puppet] - 10https://gerrit.wikimedia.org/r/576661 (owner: 10Arturo Borrero Gonzalez) [11:09:09] (03CR) 10jerkins-bot: [V: 04-1] Add apertium-pol-szl package [debs/contenttranslation/apertium-pol-szl] - 10https://gerrit.wikimedia.org/r/576628 (https://phabricator.wikimedia.org/T202276) (owner: 10KartikMistry) [11:12:09] (03PS1) 10Volans: icinga: allow to specify port for https redirects [puppet] - 10https://gerrit.wikimedia.org/r/576662 [11:13:28] (03CR) 10jerkins-bot: [V: 04-1] icinga: allow to specify port for https redirects [puppet] - 10https://gerrit.wikimedia.org/r/576662 (owner: 10Volans) [11:13:46] (03PS1) 10Jbond: cli: filter the hosts array to remove empty elements [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 [11:14:12] (03CR) 10jerkins-bot: [V: 04-1] cli: filter the hosts array to remove empty elements [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 (owner: 10Jbond) [11:14:23] (03PS2) 10Volans: icinga: allow to specify port for https redirects [puppet] - 10https://gerrit.wikimedia.org/r/576662 [11:15:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576662 (owner: 10Volans) [11:16:25] (03PS3) 10Muehlenhoff: gradle: upgrade cas software to cas 6.1.5 and Tomcat 9.0.31 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576656 (owner: 10Jbond) [11:16:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576656 (owner: 10Jbond) [11:17:29] (03CR) 10Volans: [C: 03+2] icinga: allow to specify port for https redirects [puppet] - 10https://gerrit.wikimedia.org/r/576662 (owner: 10Volans) [11:17:56] (03PS4) 10Jbond: gradle: upgrade cas software to cas 6.1.5 and tomcat to 9.0.31 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576656 [11:18:48] (03CR) 10Jbond: [V: 03+2 C: 03+2] gradle: upgrade cas software to cas 6.1.5 and tomcat to 9.0.31 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576656 (owner: 10Jbond) [11:19:51] !log upgrading ATS to version 8.0.6 on eqsin [11:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:32] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 400 Bad Request https://wikitech.wikimedia.org/wiki/Debmonitor [11:20:59] yeah yeah I know icinga, if you weren't so slow to run puppet you'd already got the fix ;) [11:21:05] lol [11:21:33] (03PS1) 10KartikMistry: Apertium: Update to new upstream release 3.6.1 [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/576664 (https://phabricator.wikimedia.org/T234182) [11:21:42] Notice: Applied catalog in 60.15 seconds [11:22:59] RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 505 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [11:23:12] (03Abandoned) 10Muehlenhoff: Add profile::prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/572892 (owner: 10Muehlenhoff) [11:23:46] 10Operations, 10fundraising-tech-ops, 10netops: DHCP routing issue with civi2001 - https://phabricator.wikimedia.org/T246812 (10ayounsi) Configuration looks good on the switch and router. This will need to be live troubleshooted, when the client is sending DHCP requests. First step is to run `show ethernet... [11:31:10] (03CR) 10jerkins-bot: [V: 04-1] Apertium: Update to new upstream release 3.6.1 [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/576664 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [11:31:11] RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 505 bytes in 0.239 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [11:35:42] (03PS2) 10KartikMistry: Apertium: Update to new upstream release 3.6.1 [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/576664 (https://phabricator.wikimedia.org/T234182) [11:42:30] (03CR) 10jerkins-bot: [V: 04-1] Apertium: Update to new upstream release 3.6.1 [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/576664 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [11:44:36] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: introduce filtering for neutron BGP addresses - https://phabricator.wikimedia.org/T246887 (10aborrero) [11:51:38] (03CR) 10Jbond: [C: 03+1] Add cas-server-core-util to Gradle dependencies [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576387 (owner: 10Muehlenhoff) [11:51:57] !log START warm cache for db1111 & db1126 for Q25-30 million T219123 (pass 2) [11:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:02] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [11:53:15] (03PS1) 10Muehlenhoff: Bump meta package for new ABI in 4.9.210 for jessie [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/576796 [11:55:50] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump meta package for new ABI in 4.9.210 for jessie [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/576796 (owner: 10Muehlenhoff) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T1200). Please do the needful. [12:00:05] Urbanecm and kostajh: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:14] I can SWAT today! [12:00:31] (03CR) 10Urbanecm: [C: 03+2] Add throttle exempt for 2020-03-07 GenderGap Event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576437 (https://phabricator.wikimedia.org/T246813) (owner: 10RhinosF1) [12:01:04] kostajh: you around? [12:01:06] hi Urbanecm [12:01:30] (03Merged) 10jenkins-bot: Add throttle exempt for 2020-03-07 GenderGap Event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576437 (https://phabricator.wikimedia.org/T246813) (owner: 10RhinosF1) [12:02:22] kostajh: +2'ed your backport, will ping you once it's ready [12:04:08] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 85a5c05: Add throttle exempt for 2020-03-07 GenderGap Event (T246813) (duration: 01m 05s) [12:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:15] T246813: Add throttle exception for Gendergap event - Utrecht writing day on 2020-03-07 - https://phabricator.wikimedia.org/T246813 [12:06:52] (03PS2) 10Jbond: cli: filter the hosts array to remove empty elements [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 [12:07:25] (03PS4) 10Urbanecm: IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) (owner: 10RhinosF1) [12:07:26] Urbanecm: thx [12:08:09] (03CR) 10Zoranzoki21: [C: 04-1] IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) (owner: 10RhinosF1) [12:08:37] (03PS5) 10Urbanecm: IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) (owner: 10RhinosF1) [12:08:57] (03CR) 10Urbanecm: IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) (owner: 10RhinosF1) [12:10:20] (03CR) 10Zoranzoki21: [C: 03+1] "Looks good for me." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) (owner: 10RhinosF1) [12:10:36] (03CR) 10Urbanecm: [C: 03+2] IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) (owner: 10RhinosF1) [12:11:48] (03Merged) 10jenkins-bot: IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) (owner: 10RhinosF1) [12:11:52] !log imported linux-meta 1.23 to apt.wikimedia.org for jessie-wikimedia [12:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:34] (03PS3) 10KartikMistry: Apertium: Update to new upstream release 3.6.1 [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/576664 (https://phabricator.wikimedia.org/T234182) [12:14:30] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 1fa9dda: IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) (T246832) (duration: 01m 06s) [12:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:34] T246832: Temporary lift of IP cap for 134.155.* for 2020-04-01 - https://phabricator.wikimedia.org/T246832 [12:16:09] kostajh: ready for testing at mwdebug1001 [12:16:17] Urbanecm: thx, looking [12:17:01] Urbanecm: all good, thanks [12:17:34] ack, syncing [12:19:00] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/GrowthExperiments/includes/HelpPanel/QuestionStore.php: SWAT: d495f4c: Replace loadRevisionFromId which has been removed in I0c8fe834da79c (duration: 01m 06s) [12:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:06] kostajh: done. Anything else? :-) [12:19:07] !log installing 4.9.210-1~deb8u1 kernel on jessie hosts (no reboots, just the upgrade) [12:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:16] (03CR) 10jerkins-bot: [V: 04-1] Apertium: Update to new upstream release 3.6.1 [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/576664 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [12:20:31] Urbanecm: that's it for today, thanks! [12:20:37] happy to help! [12:20:41] !log EU SWAT done [12:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:59] 10Operations, 10ops-eqiad, 10DC-Ops: audit/rebalance power in a5-eqiad - https://phabricator.wikimedia.org/T245655 (10ayounsi) Done! Those two are alerting as well: * https://librenms.wikimedia.org/graphs/to=1583324100/id=9043/type=sensor_power/from=1583237700/ * https://librenms.wikimedia.org/graphs/to=158... [12:21:40] (03CR) 10Volans: [C: 04-1] "LGTM, one detail to be fixed, see inline, and I think it's good to go from my side." (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [12:23:24] !log add flowspec rule on cr3-knams - T243482 [12:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:30] (03CR) 10Hnowlan: "Happy to go ahead with this once I figure out how to deploy mediawiki-config :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [12:27:39] (03CR) 10Hashar: "sounds good. While at it you could add a test to puppet_compiler/tests/test_controller.py :]" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 (owner: 10Jbond) [12:29:04] (03PS4) 10KartikMistry: Apertium: Update to new upstream release 3.6.1 [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/576664 (https://phabricator.wikimedia.org/T234182) [12:29:42] (03PS1) 10Jbond: base::check_puppetrun: fix comparison error [puppet] - 10https://gerrit.wikimedia.org/r/576816 [12:30:48] (03CR) 10Jbond: "implemented and tested on cloudcontrol2003-dev" [puppet] - 10https://gerrit.wikimedia.org/r/576816 (owner: 10Jbond) [12:31:15] _joe_: ^^ fix to the puppet check [12:33:35] (03PS1) 10KartikMistry: lttoolbox: Update to new upstream release 3.5.1 [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/576817 (https://phabricator.wikimedia.org/T234182) [12:37:19] (03CR) 10Hnowlan: "This change isn't going to be useful until T246389 is done." [puppet] - 10https://gerrit.wikimedia.org/r/576301 (https://phabricator.wikimedia.org/T243096) (owner: 10Hnowlan) [12:38:09] (03CR) 10jerkins-bot: [V: 04-1] lttoolbox: Update to new upstream release 3.5.1 [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/576817 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [12:38:33] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add cas-server-core-util to Gradle dependencies [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576387 (owner: 10Muehlenhoff) [12:39:41] (03PS3) 10Jbond: cli: filter the hosts array to remove empty elements [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 [12:40:10] (03CR) 10Jbond: "> Patch Set 2:" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 (owner: 10Jbond) [12:40:45] (03PS2) 10KartikMistry: lttoolbox: Update to new upstream release 3.5.1 [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/576817 (https://phabricator.wikimedia.org/T234182) [12:42:23] <_joe_> jbond42: <3 [12:42:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/576816 (owner: 10Jbond) [12:43:09] <_joe_> now when you merge it and 500 servers fire up, it's probably my fault [12:43:21] (03CR) 10Jbond: [C: 03+2] base::check_puppetrun: fix comparison error [puppet] - 10https://gerrit.wikimedia.org/r/576816 (owner: 10Jbond) [12:44:00] _joe_: well the currently disabled ones wont get the fix untill they are enabled, dont think there is an easy way around that unless i push the fix out manuly with cumin [12:45:29] (03PS5) 10KartikMistry: hfst: New upstream release 3.15.1 [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) [12:46:30] (03Abandoned) 10Jbond: puppet-merge: possible idea to add some atomic behavior to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/524331 (https://phabricator.wikimedia.org/T221529) (owner: 10Jbond) [12:50:17] (03PS1) 10Urbanecm: Add new throttle rule for WikiGap Göteborg 2020-03-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576820 (https://phabricator.wikimedia.org/T246888) [12:52:02] (03CR) 10Urbanecm: [C: 03+2] "last time throttle rule" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576820 (https://phabricator.wikimedia.org/T246888) (owner: 10Urbanecm) [12:53:28] (03Merged) 10jenkins-bot: Add new throttle rule for WikiGap Göteborg 2020-03-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576820 (https://phabricator.wikimedia.org/T246888) (owner: 10Urbanecm) [12:55:37] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: 37db2a1: Add new throttle rule for WikiGap Göteborg 2020-03-06 (T246888) (duration: 01m 04s) [12:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:42] T246888: Temporary lift of IP cap for WikiGap Göteborg 2020-03-06 - https://phabricator.wikimedia.org/T246888 [12:55:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: Clean cloud-puppetmaster hiera up and catch up with reality [puppet] - 10https://gerrit.wikimedia.org/r/576450 (https://phabricator.wikimedia.org/T235218) (owner: 10Alex Monk) [12:59:41] (03CR) 10jerkins-bot: [V: 04-1] hfst: New upstream release 3.15.1 [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) (owner: 10KartikMistry) [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T1300) [13:06:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Switch eventgate-analytics-external to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/576646 (https://phabricator.wikimedia.org/T233629) (owner: 10Alexandros Kosiaris) [13:07:47] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: introduce filtering for neutron BGP addresses - https://phabricator.wikimedia.org/T246887 (10aborrero) This is my initial proposal: `lang=diff diff -u router.org router.new --color --- router.org 2020-03-04 12:45:50.139827978 +0100 +++ route... [13:09:55] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: introduce filtering for neutron BGP addresses - https://phabricator.wikimedia.org/T246887 (10aborrero) a:03aborrero [13:14:09] !log Drop fixcopyrightwiki from sanitarium hosts (db1112, db2074) to avoid getting the data alert - T246055 [13:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:14] T246055: Drop DB tables for now-deleted fixcopyrightwiki from production - https://phabricator.wikimedia.org/T246055 [13:19:48] !log upgrading ATS to version 8.0.6 on esams [13:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:01] RECOVERY - Check systemd state on db1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:21] (03PS1) 10Alexandros Kosiaris: changeprop: Add nutcracker sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/576827 (https://phabricator.wikimedia.org/T213193) [13:29:20] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Gehel) 05Stalled→03Declined We are in the process of significantly changing the architecture of WDQS. We will address a better def... [13:30:36] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473 (10Gehel) Some work has been done to standardize SSL termination around envoy. I'm not sure if that has been applied to WDQS. We need to check, but this mi... [13:33:58] !log disable puppet on install1002 to test partman on theemin [13:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:24] (03PS6) 10KartikMistry: hfst: New upstream release 3.15.1 [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) [13:38:57] (03CR) 10Elukey: [C: 03+2] Ensure readability settings for home dirs of Analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [13:44:17] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:45:52] (03CR) 10Elukey: [C: 03+2] "PEBCAK: I changed the name of the script and forgot to re-include it, sigh" [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [13:46:41] (03PS1) 10Alexandros Kosiaris: nutcracker: Add entrypoint and user directives [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/576829 (https://phabricator.wikimedia.org/T213193) [13:46:57] jouncebot: now [13:46:57] For the next 0 hour(s) and 13 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T1300) [13:47:10] jouncebot: next [13:47:10] In 0 hour(s) and 12 minute(s): Mediawiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T1400) [13:47:31] !log START warm cache for db1111 & db1126 for Q25-30 million T219123 (pass 3) [13:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:36] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [13:47:39] (03PS2) 10Alexandros Kosiaris: nutcracker: Add entrypoint and user directives [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/576829 (https://phabricator.wikimedia.org/T213193) [13:47:58] nutcracker in docker? /o\ [13:47:58] (03PS1) 10Elukey: profile::analytics::client: add missing script [puppet] - 10https://gerrit.wikimedia.org/r/576831 [13:48:20] (I am joking I know it is needed :D) [13:48:31] (03PS2) 10Alexandros Kosiaris: changeprop: Add nutcracker sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/576827 (https://phabricator.wikimedia.org/T213193) [13:48:48] (03CR) 10Elukey: [C: 03+2] profile::analytics::client: add missing script [puppet] - 10https://gerrit.wikimedia.org/r/576831 (owner: 10Elukey) [13:49:02] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] nutcracker: Add entrypoint and user directives [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/576829 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [13:55:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] "If this works, fine by me" [puppet] - 10https://gerrit.wikimedia.org/r/574524 (https://phabricator.wikimedia.org/T246017) (owner: 10BryanDavis) [13:58:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] Set eventgate-*-to-delete LVS services to state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/576402 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [13:59:39] (03PS1) 10KartikMistry: cg3: Update to new upstream release 1.3.1 [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/576833 (https://phabricator.wikimedia.org/T234182) [14:00:03] (03CR) 10jerkins-bot: [V: 04-1] cg3: Update to new upstream release 1.3.1 [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/576833 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [14:00:04] liw and Brennen: Time to snap out of that daydream and deploy Mediawiki train - European+American Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T1400). [14:01:14] (03PS2) 10Alexandros Kosiaris: Set eventgate-*-to-delete LVS services to state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/576402 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [14:01:49] (03CR) 10jerkins-bot: [V: 04-1] Set eventgate-*-to-delete LVS services to state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/576402 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [14:03:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] "14:01:40 + git pull --quiet zuul production" [puppet] - 10https://gerrit.wikimedia.org/r/576402 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [14:03:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/576402 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [14:03:35] (03PS1) 10Lars Wirzenius: group1 wikis to 1.35.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576835 [14:03:38] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.35.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576835 (owner: 10Lars Wirzenius) [14:03:41] (03PS1) 10Elukey: Revert "profile::analytics::client: add missing script" [puppet] - 10https://gerrit.wikimedia.org/r/576836 [14:03:55] (03PS2) 10KartikMistry: cg3: Update to new upstream release 1.3.1 [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/576833 (https://phabricator.wikimedia.org/T234182) [14:04:01] (03CR) 10Elukey: [C: 03+2] Revert "profile::analytics::client: add missing script" [puppet] - 10https://gerrit.wikimedia.org/r/576836 (owner: 10Elukey) [14:05:00] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576835 (owner: 10Lars Wirzenius) [14:05:19] (03PS1) 10Elukey: Revert "Ensure readability settings for home dirs of Analytics clients" [puppet] - 10https://gerrit.wikimedia.org/r/576837 [14:05:40] (03CR) 10Elukey: [C: 03+2] Revert "Ensure readability settings for home dirs of Analytics clients" [puppet] - 10https://gerrit.wikimedia.org/r/576837 (owner: 10Elukey) [14:05:54] * elukey plays sad_trombone.wav [14:07:16] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.22 [14:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:21] !log liw@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.22 (duration: 01m 04s) [14:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:28] (03CR) 10jerkins-bot: [V: 04-1] cg3: Update to new upstream release 1.3.1 [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/576833 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [14:10:41] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db1103 [puppet] - 10https://gerrit.wikimedia.org/r/576590 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [14:10:44] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.42:4192, 10.2.1.45:4292, 10.2.1.45:32192, 10.2.1.42:31192]) https://wikitech.wikimedia.org/wiki/PyBal [14:11:28] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.45:4292, 10.2.2.45:32192]) https://wikitech.wikimedia.org/wiki/PyBal [14:11:40] marostegui: I'm on install1002 with puppet disabled, LMK if that's blocking you re: db1103 [14:11:48] akosiaris: is that you? [14:11:50] godog: nope, no problem [14:12:11] marostegui: ack! [14:12:16] marostegui: I noticed the load on db1126 shoot up a bit after that train sync fyi [14:12:18] * addshore is watching [14:12:22] hm, there's an immediate spike in DB related errors in logstash after I promoted group1 [14:12:26] yeah I am watching it too, addshore [14:12:44] addshore: Wasn't that supposed to decrease today? :) [14:12:53] vgutierrez: yeah cleaning up eventgate services for ottomata [14:12:56] liw: yeah, probably related to db1126, I am checking [14:12:57] ack [14:13:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:13:06] marostegui: we didnt get to that bit yet [14:13:24] marostegui, thank you for checking, I'll hold off on filing an issue or rolling back the train [14:13:25] !log cache warming stopped on db1126 and db1111 [14:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:37] addshore: lots of deadlocks on the master: Query: INSERT IGNORE INTO `wbt_text` (wbx_text) VALUES ('Helen Margery Acton-Adams') [14:13:37] Function: Wikibase\Lib\Store\Sql\Terms\Util\ReplicaMasterAwareRecordIdsAcquirer::insertNonExistingRecordsIntoMaster [14:14:00] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.42:4192, 10.2.1.45:4292, 10.2.1.45:32192, 10.2.1.42:31192]) https://wikitech.wikimedia.org/wiki/PyBal [14:14:02] right, i think we should rollback and then investigate [14:14:07] +1 [14:14:13] rtolling back just wikidatawiki would be fine too [14:14:16] *rolling back [14:14:19] liw: ^ [14:15:27] marostegui, ack; I haven't done that before, looking for instructions now [14:15:53] liw: I would say that whatever is faster to get us back to a normal state, but up to you [14:15:59] yupp! [14:16:26] I see instructions for rolling back everything, so I'll do that [14:16:32] ack! [14:16:48] errors gone now? [14:17:17] no [14:17:18] I see the process list on the master decreasing in size [14:18:05] I see them gone [14:18:24] scap sync-versions still running [14:18:32] last one was around 2 minutes ago from what I can see [14:18:52] !log cleanup old LVS eventgate services. T245203 [14:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:56] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [14:19:02] addshore: any idea what could've created that spike? [14:19:05] it stopped around :15 [14:19:06] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:19:08] !log liw@deploy1001 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.35.0-wmf.21 [14:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:03] marostegui: can the only thing I know we changed in this branch is moving some cleanup logic from a post request data update into a job [14:20:24] (03PS1) 10Lars Wirzenius: Revert "group1 wikis to 1.35.0-wmf.22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576839 [14:20:27] (03CR) 10Lars Wirzenius: [C: 03+2] Revert "group1 wikis to 1.35.0-wmf.22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576839 (owner: 10Lars Wirzenius) [14:20:36] That could have resulted in the jobs queueing up to start with and then more running at once? let me dig in grafana [14:21:24] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.35.0-wmf.22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576839 (owner: 10Lars Wirzenius) [14:21:58] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:22:02] addshore, marostegui, is there a known phab task for this issue that I can add as a train blocker? [14:22:20] liw: I don't know that :( [14:22:45] not that i know of for this specific case [14:22:47] I'll file a new one then, better to have two than none [14:23:49] thanks! [14:24:23] (03PS2) 10Filippo Giunchedi: install_server: hwraid-1dev partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/574661 (https://phabricator.wikimedia.org/T156955) [14:24:25] (03PS1) 10Filippo Giunchedi: install_server: use buster for theemin [puppet] - 10https://gerrit.wikimedia.org/r/576840 (https://phabricator.wikimedia.org/T215301) [14:24:50] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:25:32] !log upgrading ATS to version 8.0.6 on codfw [14:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:51] https://phabricator.wikimedia.org/T246898 [14:28:25] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) I tried to add a script that periodically enforces proper home dir permissions, but I had to revert since the admin module is of course already... [14:28:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:29:36] I note that logstash quieted down after the rollback [14:31:26] ty [14:32:54] addshore, marostegui, are you looking into that or should I attract more attention to it? I don't know what to do, so as the train conductor I feel quite helpless now [14:33:18] liw: I believe addshore is taking a look [14:33:24] yup, its on my plate [14:33:28] * addshore looks at timestamps [14:34:14] addshore, marostegui, thank you [14:34:18] https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&from=1583330381919&to=1583331632624 [14:34:33] the train happened to coincide with a large spike in edits and page creations [14:34:43] that matches yeah [14:34:48] 1.6k EPM is quite something [14:35:38] but it doesnt look like there was much more activity on the new term storage table at the time, so looking at code changes now [14:37:16] (03PS2) 10Alexandros Kosiaris: lvs: Switch eventgate-analytics-external to production [puppet] - 10https://gerrit.wikimedia.org/r/576647 (https://phabricator.wikimedia.org/T233629) [14:38:18] (03PS1) 10Elukey: admin: deprecate two old analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/576845 (https://phabricator.wikimedia.org/T246578) [14:38:19] question, are you inserting every time, even if you could know that item exists? [14:39:09] So, it could be totally unrelated, but that is a new all time high EPM for wikidata I think :P [14:39:21] because in other context that could be ok, but here, with the potential rate, plus the fact that many inserts of the same string are likely to be toghether [14:39:31] it could cause issues [14:39:32] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:49] checking --^ [14:40:00] (03PS2) 10Elukey: admin: deprecate two old analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/576845 (https://phabricator.wikimedia.org/T246578) [14:40:10] after normalizing tables the locking mechanism has to shift a bit, like it did with comment normalization work [14:41:22] Could it just be coincidence with the train? [14:41:27] After all it did stop after a few minutes [14:41:28] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:32] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [14:41:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/576845 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [14:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:50] https://usercontent.irccloud-cdn.com/file/9XZfegSW/image.png [14:42:02] just happened again [14:42:08] looks like it was not the train, but instead the edit rate [14:42:11] so maybe indeed coincidence? [14:42:20] the rate is a lot lower though [14:42:48] I have a bit of train window time left if you want me to promote to group1 again [14:42:57] but at a much smaller rate [14:43:25] liw: I would be up for trying the idea [14:43:41] back in a few, bio break [14:43:51] even if the edit rate was similar [14:44:03] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:30] addshore: https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&from=now-1h&to=now [14:45:05] maybe a bot? [14:45:14] Its the quickstatements tool [14:45:14] vs: https://logstash.wikimedia.org/goto/659efb1d3d199c9f21db52a8e37c993d [14:45:44] I am not worried about errors, but the higher rate of them [14:45:57] back; shall I promote train to group1 again? [14:47:10] spike of edits happen every time wdqs lag is 0 [14:47:19] jynus: indeed [14:47:40] but previous batch of high edit rate didn't cause errors, or they were very few [14:47:56] liw: I would be up for trying it, worst comes to the worst we have to roll back once more [14:47:59] so I wonder if something is making the edits more taxing [14:48:07] addshore, ack, rolling forward [14:48:44] I have a hunch if it happens again we will have to split our cleanup logic up into a few more transactions, which was already on the TODO list, but I wasn't expecting to have to do it this week [14:48:44] (03PS1) 10Lars Wirzenius: group1 wikis to 1.35.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576848 [14:48:46] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.35.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576848 (owner: 10Lars Wirzenius) [14:48:58] * addshore watches [14:49:09] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Xqt) >>! In T243701#5884801, @Ladsgroup wrote: > I have been thinking about this and I think I have a s... [14:49:14] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] lvs: Switch eventgate-analytics-external to production [puppet] - 10https://gerrit.wikimedia.org/r/576647 (https://phabricator.wikimedia.org/T233629) (owner: 10Alexandros Kosiaris) [14:49:23] (03CR) 10Ottomata: "Is this missing from the LVS instructions on wikitech?" [puppet] - 10https://gerrit.wikimedia.org/r/576636 (https://phabricator.wikimedia.org/T233629) (owner: 10Alexandros Kosiaris) [14:49:40] jynus: another difference this time moving forward is I am not warming the caches, which would have added load to the replicas [14:50:10] extra load on the replicas would = slower query responses for the cleanup code, which I believe is what is holding the locks that we saw the data updates run into [14:50:48] * liw waits for Jenkins [14:51:42] marostegui: again, with that spike it was interesting to see db1126 load increase higher than db1111. 1111 still has a higher share of the traffic right now though? [14:52:02] addshore: I am about to get into an interview, but yes, db1126 is 350 and db1111 is 400 [14:52:16] interesting! have a fun interview! [14:53:31] * liw still waits for Jenkins [14:53:34] :D [14:54:34] (03PS3) 10Alexandros Kosiaris: changeprop: Add nutcracker sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/576827 (https://phabricator.wikimedia.org/T213193) [14:54:50] Hello, Zuul no starts tests for https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/BrickipediaExtra/+/576849/ [14:54:52] Why? [14:54:58] (03CR) 10Herron: [C: 03+1] "ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/576333 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [14:55:11] Zoranzoki21: I think you want #wikimedia-releng ! :D [14:55:43] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-analytics-external [14:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:53] addshore: Oh, right. Thanks! [14:56:24] (03CR) 10Jbond: [C: 03+2] role::lists: use mod_cgid on buster instead for mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/576333 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [14:58:50] liw: did jerkins die? :P [14:59:14] Looks so [14:59:47] addshore, I'm worried [14:59:52] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/576848/ by liw have +2 but gate-and-submit no works needed tests :/ [15:00:01] RIP jenkins [15:00:02] (03CR) 10CDanis: [C: 03+1] install_server: use buster for theemin [puppet] - 10https://gerrit.wikimedia.org/r/576840 (https://phabricator.wikimedia.org/T215301) (owner: 10Filippo Giunchedi) [15:00:26] (03PS4) 10Alexandros Kosiaris: changeprop: Add nutcracker sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/576827 (https://phabricator.wikimedia.org/T213193) [15:01:16] gate-and-submit-1_31 works still LMAO :) [15:02:38] * thcipriani looking [15:03:04] (03PS5) 10Alexandros Kosiaris: changeprop: Add nutcracker sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/576827 (https://phabricator.wikimedia.org/T213193) [15:03:17] (03PS1) 10Giuseppe Lavagetto: envoyproxy: support tcp fast open [puppet] - 10https://gerrit.wikimedia.org/r/576851 [15:03:19] (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: split tls proxy config out of profile [puppet] - 10https://gerrit.wikimedia.org/r/576852 [15:03:21] (03PS1) 10Giuseppe Lavagetto: mwdebug: switch tls termination from nginx to envoy [puppet] - 10https://gerrit.wikimedia.org/r/576853 [15:08:02] train window is over, but still waiting for the group1 promotion [15:08:15] :D [15:09:02] the US morning SWAT starts in two hours, so there's time [15:10:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Typo inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576431 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [15:11:50] !log restarting zuul [15:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:34] liw: any luck? [15:20:48] addshore, not yet; CI is being fixed [15:20:52] :D [15:24:18] 10Operations, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10RLazarus) I'm pretty sure our envoy-build expert in residence just assigned this to me, but I'm happy to give this a shot anyway. [15:25:55] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active, ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:26:25] XioNoX: ASunknown? [15:26:46] it's the NaN version of this check? [15:27:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Looks pretty fine to me, aside from a minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576485 (owner: 10RLazarus) [15:29:20] dunno [15:29:47] !log upgrading ATS to version 8.0.6 on eqiad [15:29:49] I don't see any cricital AS down on that router [15:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:09] bblack: ? ^ (see the BGP icigna alert) can be related to your change? [15:30:50] XioNoX: that's the existing as-unknown part, I'm pretty sure [15:31:04] maybe should've used a different word than 'unknown' for the new parts [15:31:27] (03PS4) 10RLazarus: httpbb: Replace apache-fast-test with httpbb in deploy_apache_change. [puppet] - 10https://gerrit.wikimedia.org/r/576485 [15:31:30] my $asn = $bgp->get( $peer, 'PeerRemoteAs' ) || "unknown"; [15:31:33] [...] [15:32:27] it's claiming that at least one BGP session, with no reported AS number, is in Active state rather than estab [15:32:31] (03CR) 10RLazarus: httpbb: Replace apache-fast-test with httpbb in deploy_apache_change. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576485 (owner: 10RLazarus) [15:32:38] yeah that's what I don't understand [15:32:55] https://www.irccloud.com/pastebin/rIotp1pj/ [15:33:20] all the peers will have a remote AS, as it's required to configure a BGP session [15:33:27] right [15:34:41] (03CR) 10Lars Wirzenius: [C: 03+2] "Try +2 again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576848 (owner: 10Lars Wirzenius) [15:34:54] jenkins seems to like it this time! [15:35:17] liw: can we maybe wait a few mins for the sync? [15:35:40] just to try and avoid an edit spike [15:35:54] addshore, I am waiting at the command prompt for jenkins to merge the changes, so yes [15:35:59] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576848 (owner: 10Lars Wirzenius) [15:36:00] bblack: running it manually with no --ascrit I get "BGP WARNING - AS58601/IPv6: Active (for 9d16h), AS58601/IPv4: Active (for 9d16h)" as expected [15:36:09] (03PS1) 10Elukey: Add kerberos credentials/config for Superset Staging [puppet] - 10https://gerrit.wikimedia.org/r/576861 (https://phabricator.wikimedia.org/T239903) [15:36:12] yeah that's what I saw on the Icinga frontend as well XioNoX [15:36:21] yeah I did the same [15:36:31] maybe a one-shot fluke in the snmp polling? [15:37:19] https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=cr1-eqsin&service=BGP+status [15:37:24] addshore, change is merge; waiting for go-ahead before it gets synced -- where can I check if there's an edit spike going on? [15:37:32] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use buster for theemin [puppet] - 10https://gerrit.wikimedia.org/r/576840 (https://phabricator.wikimedia.org/T215301) (owner: 10Filippo Giunchedi) [15:37:40] seems like Icinga doesn't show it here when it goes CRITICAL -> WARNING [15:37:41] (03PS2) 10Filippo Giunchedi: install_server: use buster for theemin [puppet] - 10https://gerrit.wikimedia.org/r/576840 (https://phabricator.wikimedia.org/T215301) [15:37:43] liw: you can check https://grafana.wikimedia.org/d/000000170/wikidata-edits buit it is a bit delayed [15:37:44] yeah [15:37:50] so I thought it was still on CRITICAL [15:38:06] liw: so I'm actually waiting for https://test.wikidata.org/w/api.php?action=query&prop=revisions&titles=Main%20Page&maxlag=-1 to report over 5 maxlag due to the query service, then a sync would be good [15:38:08] so the script does all of that 'intentionally', but we could question or change its intent! [15:38:25] yeah I'll call that a temporary fluke, and see if it happen again [15:38:33] (an unknown ASN is automatically CRIT, whereas otherwise any non-critical ASN that's a real number is a warning) [15:38:45] liw: should be around 5 mins if that is okay [15:38:46] we could make unknown-asn a warning too [15:39:15] Or could be a unknown [15:39:15] addshore, sure; I'll ping in 5 mins if you haven't said go/no-go before then [15:39:26] but yeah, let's see if/when it happen again :) [15:40:02] (03CR) 10RLazarus: "> Please either test it if you have a host that matches requirements or coordinate with the next reimages to test it." [puppet] - 10https://gerrit.wikimedia.org/r/576464 (owner: 10RLazarus) [15:40:08] (03CR) 10RLazarus: [C: 03+2] cumin: Replace apache-fast-test with httpbb in reimage scripts [puppet] - 10https://gerrit.wikimedia.org/r/576464 (owner: 10RLazarus) [15:40:12] ok [15:41:07] (03CR) 10Elukey: [C: 03+2] Add kerberos credentials/config for Superset Staging [puppet] - 10https://gerrit.wikimedia.org/r/576861 (https://phabricator.wikimedia.org/T239903) (owner: 10Elukey) [15:41:29] elukey: okay to merge yours? [15:41:35] rlazarus: thanks! [15:42:34] done 👍 [15:42:56] gooood [15:44:01] (03CR) 10Herron: [C: 03+1] "LGTM! obligatory optional nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576639 (owner: 10Muehlenhoff) [15:44:27] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) @chasemp I've created an administrator account for you and you should've received a password reset email. Please sign in at console.j... [15:44:45] liw: lets try now [15:44:53] addshore, ack [15:45:07] syncing in progress [15:45:29] * addshore watches "all the graphs" [15:45:57] (03CR) 10RLazarus: "Updated PCC: https://puppet-compiler.wmflabs.org/compiler1002/21263/" [puppet] - 10https://gerrit.wikimedia.org/r/576485 (owner: 10RLazarus) [15:46:06] (03PS1) 10Elukey: Add Presto Kerberos settings to Superset Staging [puppet] - 10https://gerrit.wikimedia.org/r/576863 (https://phabricator.wikimedia.org/T239903) [15:46:12] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.22 [15:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:34] * addshore keeps watching [15:47:16] !log liw@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.22 (duration: 01m 03s) [15:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:26] (03PS3) 10Volans: scripts: add decommission device script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576461 (https://phabricator.wikimedia.org/T244315) [15:48:09] (03CR) 10Elukey: [C: 03+2] Add Presto Kerberos settings to Superset Staging [puppet] - 10https://gerrit.wikimedia.org/r/576863 (https://phabricator.wikimedia.org/T239903) (owner: 10Elukey) [15:48:17] (03CR) 10Vgutierrez: [C: 03+1] cache: map logstash-next.wikimedia.org and cas-logstash to kibana-next lvs [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:48:48] liw: looking good [15:49:15] the deadlocks seem to be unrelated to the train, and instead related to these edit spikes [15:49:23] next ask, figure out how to control those spikes.. [15:49:26] *task [15:53:08] (03PS1) 10Ottomata: Remove unsed eventgate*-to-delete LVS service declarations [puppet] - 10https://gerrit.wikimedia.org/r/576865 (https://phabricator.wikimedia.org/T245203) [15:53:28] next spike will happen at 17:0X [15:54:16] (03PS3) 10Herron: cache: map logstash-next.wikimedia.org and cas-logstash to kibana-next lvs [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) [15:54:20] 17:05 or so [15:54:27] no bueno [15:55:56] need a way to control the edit rate of an oauth tool [15:55:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Feel free to merge" [puppet] - 10https://gerrit.wikimedia.org/r/576865 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [15:56:19] 10Operations, 10fundraising-tech-ops, 10netops: DHCP routing issue with civi2001 - https://phabricator.wikimedia.org/T246812 (10ayounsi) a:03Papaul At bootup that MAC shows up on ge-1/0/9 ` ayounsi@fasw-c-codfw> show ethernet-switching table | match 4c:d9:8f:aa:77:b4 frack-payments-codfw 4c:d9:8f:aa:7... [15:57:15] the train moving forward should actually make these deadlocks less likely to happen (which I why I was so confused when they started happening when the train first moved forward) [15:57:34] (03CR) 10Herron: [C: 03+2] cache: map logstash-next.wikimedia.org and cas-logstash to kibana-next lvs [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:57:43] (03CR) 10Ottomata: [C: 03+2] Remove unsed eventgate*-to-delete LVS service declarations [puppet] - 10https://gerrit.wikimedia.org/r/576865 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [15:58:12] herron: :) [15:58:14] ok to merge? [15:58:21] puppet-mergee ? [15:58:39] * addshore reads https://www.mediawiki.org/wiki/Manual:Edit_throttling [15:58:47] heya, yes please! [15:58:53] k [16:00:09] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10ayounsi) Great, thanks, are you going to take care of moving the files over? [16:00:21] (03CR) 10Jhedden: [C: 03+1] codesearch: Prevent ferm from deleting Docker iptables rules [puppet] - 10https://gerrit.wikimedia.org/r/574524 (https://phabricator.wikimedia.org/T246017) (owner: 10BryanDavis) [16:02:04] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --sleep 1 --batch-size=50 # T244115 [16:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:10] T244115: Investigate & Fix holes for aliases in new term tables (take 3) - https://phabricator.wikimedia.org/T244115 [16:04:32] (03PS2) 10Muehlenhoff: Add new DNS entries for logstash-next plus the CAS counter part [dns] - 10https://gerrit.wikimedia.org/r/575530 [16:04:57] akosiaris: gerrit1002 is only a temp machine for releng to test the upgrade to 2.16. it is neither prod nor failover. yea, the icinga checks should have been in permanent downtime though. maybe should have called it gerrit-test, but it was nothing to worry about [16:05:17] !log destroying unused eventgate-main 'main' and eventgate-analytics 'analytics' helm releases - installed: false [16:05:20] gah [16:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:23] the processes appearing and disappearing would have been people working on it [16:05:40] mutante: oh, I wasn't worried. not at all. But it was weird seing systemd restarting gerrit so much [16:06:22] mutante: I don't think is was humans doing it though [16:06:30] yea, indeed. i think it was expired downtime and we did 1 month in early Feb or so [16:06:31] e.g. I am logged alone in that machine [16:06:31] and [16:06:34] Active: active (running) since Wed 2020-03-04 16:06:03 UTC; 1s ago [16:06:42] 1s ago, so it gets restarted a lot [16:06:49] * akosiaris on the move [16:06:50] (03PS3) 10Herron: Add new DNS entries for logstash-next plus the CAS counter part [dns] - 10https://gerrit.wikimedia.org/r/575530 (owner: 10Muehlenhoff) [16:07:44] (03PS1) 10Ottomata: Remove unused 'main' and 'analytics' releases from eventgate helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/576871 (https://phabricator.wikimedia.org/T245203) [16:07:47] hmm. ok. *nod*, though people used this for testing before. fixing icinga and forwarding info [16:08:29] (03CR) 10Jforrester: [C: 03+1] "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [16:11:32] (03CR) 10Ottomata: [C: 03+2] Remove unused 'main' and 'analytics' releases from eventgate helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/576871 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [16:11:42] (03CR) 10Herron: [C: 03+2] Add new DNS entries for logstash-next plus the CAS counter part [dns] - 10https://gerrit.wikimedia.org/r/575530 (owner: 10Muehlenhoff) [16:11:45] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 8.292 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:14:19] looking into that indexing errors alert now [16:15:29] (03PS1) 10Ottomata: Remove old eventgate-analytics LVS port from Analyitcs VLAN firewall [homer/public] - 10https://gerrit.wikimedia.org/r/576873 (https://phabricator.wikimedia.org/T233629) [16:15:42] looks like this is gradually increasing since hours ago https://logstash.wikimedia.org/goto/f4de6424805cbc0a69eb646d35ad72d9 [16:16:42] (03PS2) 10Muehlenhoff: Remove system::role from role::logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/576639 [16:17:15] (03CR) 10Muehlenhoff: Remove system::role from role::logstash::collector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576639 (owner: 10Muehlenhoff) [16:17:50] 10Operations, 10ops-eqiad, 10User-jbond, 10cloud-services-team (Hardware): drain cloudvirt1006 for battery replacement - https://phabricator.wikimedia.org/T246908 (10Andrew) [16:17:53] (03CR) 10Cwhite: [C: 03+1] "LGTM as is. optional nit inline." (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 (owner: 10Jbond) [16:19:38] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/576639 (owner: 10Muehlenhoff) [16:19:54] this is looking like the worst offender: error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [user_id] of type [long] in document with id 'l_tVpnABWhS8m0eij0ij'. Preview of field's value: 'logged-out'", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"For input string: \"logged-out\""}. [16:19:58] https://logstash.wikimedia.org/goto/cafeac68f1f96de2c6dc597332f4d678 [16:21:52] these are _type=>"restbase" [16:21:56] ugh, index rolled over and the first instance of user_id was numeric :/ [16:22:21] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Cmjohnson) [16:25:12] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Cmjohnson) @dzahn or whoever needs these, all of them with the exception of mw1403 is ready for service implementation. mw1403 is not installing and I a... [16:25:20] (03CR) 10Dzahn: [C: 03+2] cross-validate-accounts: also check wmde group against admins [puppet] - 10https://gerrit.wikimedia.org/r/513201 (owner: 10Dzahn) [16:25:29] (03PS3) 10Dzahn: cross-validate-accounts: also check wmde group against admins [puppet] - 10https://gerrit.wikimedia.org/r/513201 [16:26:04] (03CR) 10Muehlenhoff: [C: 03+2] Remove system::role from role::logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/576639 (owner: 10Muehlenhoff) [16:26:21] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [16:26:58] seems that should be ok, since user_id is numeric with the exception of "logged-out" [16:27:24] wonder if "logged-out" is useful, or if that field could simply be dropped when "logged-out" [16:28:28] (03PS2) 10Ottomata: Add check_eventgate_analyltics_external_cluster [puppet] - 10https://gerrit.wikimedia.org/r/576431 (https://phabricator.wikimedia.org/T233629) [16:30:04] (03CR) 10Dzahn: [C: 04-2] "> There is 1 complication, needing to reimage those servers with a slightly different partman recipe. The one used at the beginning seems " [puppet] - 10https://gerrit.wikimedia.org/r/576406 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [16:31:26] (03CR) 10Ottomata: [C: 03+2] Add check_eventgate_analyltics_external_cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576431 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [16:31:35] (03CR) 10Hnowlan: [C: 04-1] changeprop: Add nutcracker sidecar (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/576827 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [16:31:56] (03PS1) 10Jbond: ssosessions: enable the sso sessions end point [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576884 (https://phabricator.wikimedia.org/T233938) [16:32:07] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [16:32:46] (03PS1) 10Ayounsi: Add support for multiple SNMP communities [homer/mock-private] - 10https://gerrit.wikimedia.org/r/576885 (https://phabricator.wikimedia.org/T246890) [16:32:58] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1403.eqiad.wmnet ` The log can be found in `... [16:33:37] (03PS1) 10Ayounsi: Add support for multiple BGP communities [homer/public] - 10https://gerrit.wikimedia.org/r/576886 (https://phabricator.wikimedia.org/T246890) [16:33:43] (03PS2) 10Ottomata: Add discovery for eventgate-analytics-external [dns] - 10https://gerrit.wikimedia.org/r/573367 (https://phabricator.wikimedia.org/T233629) [16:34:29] (03PS1) 10Dzahn: netboot: fix typo in ganeti partman recipe selector [puppet] - 10https://gerrit.wikimedia.org/r/576887 [16:36:10] (03CR) 10Ayounsi: [C: 03+1] Remove old eventgate-analytics LVS port from Analyitcs VLAN firewall [homer/public] - 10https://gerrit.wikimedia.org/r/576873 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [16:38:59] (03PS1) 10Krinkle: multiversion: Update copy of SiteConfiguration to match current MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576889 [16:39:01] (03PS1) 10Krinkle: tests: Remove SiteConfiguration, use src/StaticSiteConfiguration instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576890 [16:40:13] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576499 (https://phabricator.wikimedia.org/T241289) (owner: 10CRusnov) [16:43:03] (03CR) 10Dzahn: "This bracket looks wrong... and given the seemingly wrong partman recipe for new servers..." [puppet] - 10https://gerrit.wikimedia.org/r/576887 (owner: 10Dzahn) [16:44:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576884 (https://phabricator.wikimedia.org/T233938) (owner: 10Jbond) [16:44:56] (03PS1) 10Kosta Harlan: Switch kowiki and viwiki to use ORES for suggested edits topics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576892 (https://phabricator.wikimedia.org/T246171) [16:45:51] (03PS1) 10Muehlenhoff: Adapt CAS vhost name for Kibana 7 [puppet] - 10https://gerrit.wikimedia.org/r/576893 [16:45:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [16:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:04] (03CR) 10Herron: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/576893 (owner: 10Muehlenhoff) [16:49:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:39] !log otto@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-analytics-external [16:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:11] (03PS2) 10Dzahn: netboot/partman: add ganeti101[3-8] and fix typo in selector [puppet] - 10https://gerrit.wikimedia.org/r/576887 (https://phabricator.wikimedia.org/T228924) [16:51:52] (03PS3) 10Ottomata: Add discovery for eventgate-analytics-external [dns] - 10https://gerrit.wikimedia.org/r/573367 (https://phabricator.wikimedia.org/T233629) [16:52:27] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/576886 (https://phabricator.wikimedia.org/T246890) (owner: 10Ayounsi) [16:52:45] 10Operations, 10Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10MusikAnimal) > [x] Remove references from the CentralAuth database. This item is checked but I still see it in `meta_p.wiki` on the Toolforge... [16:52:49] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/576885 (https://phabricator.wikimedia.org/T246890) (owner: 10Ayounsi) [16:53:10] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/21264/" [puppet] - 10https://gerrit.wikimedia.org/r/576893 (owner: 10Muehlenhoff) [16:53:15] (03CR) 10Muehlenhoff: [C: 03+2] Adapt CAS vhost name for Kibana 7 [puppet] - 10https://gerrit.wikimedia.org/r/576893 (owner: 10Muehlenhoff) [16:53:17] (03CR) 10Ottomata: [C: 03+2] Add discovery for eventgate-analytics-external [dns] - 10https://gerrit.wikimedia.org/r/573367 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [16:53:19] (03PS8) 10CRusnov: netbox: Add framework for exposing scripts to internal services [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) [16:53:31] (03CR) 10Jforrester: [C: 03+1] multiversion: Update copy of SiteConfiguration to match current MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576889 (owner: 10Krinkle) [16:53:45] (03PS3) 10Dzahn: netboot/partman: add new ganeti servers and fix typo in selector [puppet] - 10https://gerrit.wikimedia.org/r/576887 (https://phabricator.wikimedia.org/T228924) [16:54:12] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1403.eqiad.wmnet'] ` and were **ALL** successful. [16:54:21] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add framework for exposing scripts to internal services [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [16:55:27] (03PS1) 10Cmjohnson: Adding mgmt dns for logstash102[6-9] [dns] - 10https://gerrit.wikimedia.org/r/576897 (https://phabricator.wikimedia.org/T240881) [16:55:33] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for logstash102[6-9] [dns] - 10https://gerrit.wikimedia.org/r/576897 (https://phabricator.wikimedia.org/T240881) (owner: 10Cmjohnson) [16:56:24] (03CR) 10Hnowlan: "> I can do that for you, or pair to demostrate how to do it if you're curious" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [16:56:31] (03PS9) 10CRusnov: netbox: Add framework for exposing scripts to internal services [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) [16:56:39] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [16:59:20] 10Operations, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Cmjohnson) a:05Cmjohnson→03jijiki @jijiki all servers are now ready for implementation. I am removing the ops-eqiad tag and assigned to you [17:01:29] (03PS1) 10Dzahn: cloud: add parsoid cluster in cloud Hiera [puppet] - 10https://gerrit.wikimedia.org/r/576898 (https://phabricator.wikimedia.org/T246854) [17:02:37] 10Operations, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Dzahn) a:05jijiki→03None Thanks @Cmjohnson! I'll take that as jijiki is currently away. [17:02:48] 10Operations, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Dzahn) a:03Dzahn [17:05:40] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.7542 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:06:32] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/576493 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [17:07:47] 10Operations, 10Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) >>! In T238803#5941793, @MusikAnimal wrote: >> [x] Remove references from the CentralAuth database. > > This item is checked... [17:09:07] (03PS2) 10Dzahn: site: add ganeti role to all new ganeti servers [puppet] - 10https://gerrit.wikimedia.org/r/576406 (https://phabricator.wikimedia.org/T228924) [17:09:38] (03CR) 10Dzahn: [C: 04-2] "stalled until after reinstall" [puppet] - 10https://gerrit.wikimedia.org/r/576406 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [17:10:55] (03CR) 10Jforrester: [C: 03+1] cloud: add parsoid cluster in cloud Hiera [puppet] - 10https://gerrit.wikimedia.org/r/576898 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [17:11:47] (03PS3) 10Dzahn: site: add installserver::light role on new install servers [puppet] - 10https://gerrit.wikimedia.org/r/572394 (https://phabricator.wikimedia.org/T224576) [17:12:02] 10Operations, 10Traffic: switch to irate() instead of rate() for traffic graphs - https://phabricator.wikimedia.org/T246902 (10DannyS712) Guessing this is about traffic? [17:12:13] (03Abandoned) 10Dzahn: site: add installserver::light role on new install servers [puppet] - 10https://gerrit.wikimedia.org/r/572394 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [17:14:01] (03PS1) 10Jhedden: openstack: switch cloudvirt-wdqs servers to Ceph [puppet] - 10https://gerrit.wikimedia.org/r/576903 (https://phabricator.wikimedia.org/T221631) [17:20:10] (03CR) 10Dzahn: installserver: add parameter for DHCP interface (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [17:21:40] (03PS2) 10Cmjohnson: Adding mgmt dns for logstash102[6-9] [dns] - 10https://gerrit.wikimedia.org/r/576897 (https://phabricator.wikimedia.org/T240881) [17:21:50] (03PS6) 10Dzahn: installserver: ensure interface for DHCP server is configured [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) [17:22:05] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for logstash102[6-9] [dns] - 10https://gerrit.wikimedia.org/r/576897 (https://phabricator.wikimedia.org/T240881) (owner: 10Cmjohnson) [17:22:25] (03PS3) 10Cmjohnson: Adding mgmt dns for logstash102[6-9] [dns] - 10https://gerrit.wikimedia.org/r/576897 (https://phabricator.wikimedia.org/T240881) [17:22:42] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for logstash102[6-9] [dns] - 10https://gerrit.wikimedia.org/r/576897 (https://phabricator.wikimedia.org/T240881) (owner: 10Cmjohnson) [17:24:26] (03CR) 10jerkins-bot: [V: 04-1] installserver: ensure interface for DHCP server is configured [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [17:26:23] (03CR) 10Bstorm: [C: 03+2] toolforge: remove special configuration for kubernetes on proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/576469 (https://phabricator.wikimedia.org/T214513) (owner: 10Bstorm) [17:26:56] jouncebot: next [17:26:56] In 1 hour(s) and 33 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T1900) [17:29:06] (03PS1) 10Dwisehaupt: Fix frpm2001 ip [dns] - 10https://gerrit.wikimedia.org/r/576905 (https://phabricator.wikimedia.org/T242269) [17:32:39] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939 (10Mholloway) 05Open→03Stalled Moving this out of Tracking and to the Backlog since PI engineers are actually involved with it. Changing the status t... [17:32:42] (03PS1) 10Elukey: profile::superset: allow to deploy Presto TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/576907 (https://phabricator.wikimedia.org/T239903) [17:33:43] (03Abandoned) 10Cmjohnson: Adding mgmt dns for logstash102[6-9] [dns] - 10https://gerrit.wikimedia.org/r/576897 (https://phabricator.wikimedia.org/T240881) (owner: 10Cmjohnson) [17:34:23] (03CR) 10Jgreen: [C: 03+2] Fix frpm2001 ip [dns] - 10https://gerrit.wikimedia.org/r/576905 (https://phabricator.wikimedia.org/T242269) (owner: 10Dwisehaupt) [17:35:03] (03CR) 10jerkins-bot: [V: 04-1] profile::superset: allow to deploy Presto TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/576907 (https://phabricator.wikimedia.org/T239903) (owner: 10Elukey) [17:35:26] (03PS1) 10Cwhite: profile: add restbase filter and coerce err key to string [puppet] - 10https://gerrit.wikimedia.org/r/576908 (https://phabricator.wikimedia.org/T239090) [17:36:32] (03PS2) 10Cwhite: profile: add restbase filter and coerce err key to string [puppet] - 10https://gerrit.wikimedia.org/r/576908 (https://phabricator.wikimedia.org/T239090) [17:37:06] (03PS1) 10Cmjohnson: Adding mgmt dns for logstash102[6-9] [dns] - 10https://gerrit.wikimedia.org/r/576909 (https://phabricator.wikimedia.org/T240881) [17:39:37] (03PS1) 10Cwhite: profile: coerce mediawiki user_id field to string in logstash [puppet] - 10https://gerrit.wikimedia.org/r/576910 (https://phabricator.wikimedia.org/T239458) [17:41:01] !log stop item term rebuild at Q Q60345318 as I generate more lists (T219123) [17:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:06] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [17:41:49] (03PS2) 10Elukey: profile::superset: allow to deploy Presto TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/576907 (https://phabricator.wikimedia.org/T239903) [17:43:38] (03PS7) 10Dzahn: installserver: ensure interface for DHCP server is configured [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) [17:44:37] (03PS1) 10Bstorm: toolforge: clean up maintain_kubeusers and legacy proxy puppet code. [puppet] - 10https://gerrit.wikimedia.org/r/576911 (https://phabricator.wikimedia.org/T246689) [17:44:41] (03CR) 10Elukey: [C: 03+2] profile::superset: allow to deploy Presto TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/576907 (https://phabricator.wikimedia.org/T239903) (owner: 10Elukey) [17:49:18] 10Operations, 10Traffic: switch to irate() instead of rate() for traffic graphs - https://phabricator.wikimedia.org/T246902 (10CDanis) Yes, but about Observability too. This is/was a placeholder task so I don't forget to write a proper description later :) [17:49:44] 10Operations, 10netops: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364 (10ayounsi) 05Open→03Resolved a:03ayounsi I don't think there is much value anymore for this task (it was for last year). The spreadsheet for next FY capex has a 10G column. [17:52:14] (03CR) 10Bstorm: "This seems to do what it should do. I am not obsessing over cleanup for tools-k8s-master-01 because that thing will be deleted. The proxy " [puppet] - 10https://gerrit.wikimedia.org/r/576911 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [17:54:33] (03PS1) 10Hnowlan: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) [18:01:10] (03PS8) 10Dzahn: installserver: ensure interface for DHCP server is configured [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) [18:01:59] (03CR) 10Dzahn: [C: 03+2] cloud: add parsoid cluster in cloud Hiera [puppet] - 10https://gerrit.wikimedia.org/r/576898 (https://phabricator.wikimedia.org/T246854) (owner: 10Dzahn) [18:02:10] (03PS8) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) [18:04:15] (03CR) 10jerkins-bot: [V: 04-1] installserver: ensure interface for DHCP server is configured [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [18:06:47] (03CR) 10Bstorm: "Just a note: I'm uncherry-picking this from toolsbeta for now because it is causing problems for cleaning up the old Kubernetes cluster. " [puppet] - 10https://gerrit.wikimedia.org/r/566491 (https://phabricator.wikimedia.org/T218427) (owner: 10Legoktm) [18:07:00] (03PS9) 10Dzahn: installserver: ensure interface for DHCP server is configured [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) [18:14:02] (03PS4) 10Jbond: cli: filter the hosts array to remove empty elements [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 [18:14:10] (03CR) 10Bstorm: [C: 03+2] toolforge: clean up maintain_kubeusers and legacy proxy puppet code. [puppet] - 10https://gerrit.wikimedia.org/r/576911 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [18:16:16] (03CR) 10BryanDavis: "Two +1's don't quite make a +2. :) Someone with root needs to actually merge, but maybe it would be a good idea to coordinate with Legoktm" [puppet] - 10https://gerrit.wikimedia.org/r/574524 (https://phabricator.wikimedia.org/T246017) (owner: 10BryanDavis) [18:18:41] (03PS5) 10Jbond: cli: filter the hosts array to remove empty elements [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 [18:18:59] (03CR) 10Jbond: "updated and expanded thanks" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 (owner: 10Jbond) [18:20:16] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/21267/install1002.wikimedia.org/change.install1002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [18:20:39] (03PS10) 10Dzahn: installserver: ensure interface for DHCP server is configured [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) [18:24:04] (03CR) 10Dzahn: [C: 03+2] "now it works as expected. noop on prod servers. https://puppet-compiler.wmflabs.org/compiler1001/21268/" [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [18:25:58] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Abban Dunne to the ldap/wmde group - https://phabricator.wikimedia.org/T246664 (10RStallman-legalteam) The NDA is signed and on file. Thank you! [18:27:27] (03PS2) 10Hnowlan: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) [18:27:38] (03CR) 10Dzahn: "works: INTERFACESv4="ens5" # Managed by puppet" [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [18:28:16] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:28:17] (03CR) 10Bstorm: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/566491 (https://phabricator.wikimedia.org/T218427) (owner: 10Legoktm) [18:28:24] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:28:30] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:36] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:28:38] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:28:42] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:28:54] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:29:28] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [18:30:04] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:30:21] ^ sigh, we could do self-healing for that with eventhandlers, it's so common [18:30:36] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:30:42] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:46] !log notebook1003 - restarted nagios-nrpe-server [18:30:48] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:52] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:30:56] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:31:37] mutante: yeah I know sorry :( I think I fixed the problems with stat100X with a global limit of the systemd user.slice, but notebooks are using the system.slice that currently has no limits [18:31:42] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [18:31:52] (I believe that nagios is under it too) [18:32:06] so I'll try tomorrow to add something, thanks for following up [18:32:09] elukey: ah, but that's cool for stats servers :) [18:32:18] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:32:19] oh, thank you [18:32:42] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:34:54] sbassett: mind a PM? [18:35:21] I responded on phab but i can elaborate if needed [18:36:52] RhinosF1: PM fine [18:36:56] I'll look at comment too [18:37:35] (03CR) 10Dzahn: "noop on prod DHCP (install1002/2002)" [puppet] - 10https://gerrit.wikimedia.org/r/576479 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [18:40:53] (03PS1) 10Muehlenhoff: Remove cas-logstash-next from IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/576921 [18:43:02] (03PS3) 10Hnowlan: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) [18:43:06] (03PS1) 10Ottomata: Blacklist mediawiki_job_CleanTermsIfUnused from refinement [puppet] - 10https://gerrit.wikimedia.org/r/576922 [18:43:08] !log starting new DHCP servers to confirm they work and letting puppet immediately stop them again to clear systemd status [18:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:30] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:48:00] (03CR) 10Ottomata: [C: 03+2] Blacklist mediawiki_job_CleanTermsIfUnused from refinement [puppet] - 10https://gerrit.wikimedia.org/r/576922 (owner: 10Ottomata) [18:53:38] (03PS4) 10Hnowlan: jobrunner: Standard mediawiki webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) [18:57:21] (03CR) 10Dzahn: [C: 03+1] "lgtm, compiler results look fine and change catalogs show directories are only removed on inactive server. https://puppet-compiler.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/576323 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T1900). [19:00:04] ottomata: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:15] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) [19:00:39] here [19:01:17] for whoever is doing swat, I have one backport [19:01:23] i know how to swat config changes via scap sync file [19:01:34] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) Management interface network config had some issues, fixed now. [19:01:36] but not code changes (via whateever else? just scap deploy?) [19:01:56] this one https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/576923 [19:03:00] addshore: the tool as a whole, as opposed to individual users? [19:04:51] probably not possible out of the box, but should be easy to add, file a task [19:06:09] Niharika: are you swatting? [19:06:22] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) >>! In T242602#5941498, @ayounsi wrote: > Great, thanks, are you going to take care of moving the files over? Yes, it's fully puppetized. rsync of /srv/ from the primary ser... [19:08:06] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:08:09] RoanKattouw: swat? [19:08:16] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:08:20] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:26] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:08:30] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:08:32] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:09:18] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [19:09:53] ottomata: sorry I'm at a conference so it's not a great time [19:09:56] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:10:28] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:10:42] ok! np not sure how these swat peoplee get scheduled [19:10:44] Urbanecm: ? [19:10:45] swat? [19:11:01] ottomata: no one else's around? [19:11:04] well this time notebook1003 is actually down [19:11:13] and not just nagios-nrpe-server.. it looks [19:11:29] ottomata: I can SWAT if you want [19:11:55] or help you to sync this out if you want [19:12:00] (03CR) 10Jforrester: [C: 03+1] tests: Remove SiteConfiguration, use src/StaticSiteConfiguration instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576890 (owner: 10Krinkle) [19:12:16] Urbanecm: would love some instructions, i have deploy perms just not sure of the right way for code backports [19:12:29] (03PS1) 10Andrew Bogott: keystone: port some custom .py files to python3 [puppet] - 10https://gerrit.wikimedia.org/r/576927 [19:12:31] (03PS1) 10Andrew Bogott: neutron: update l3_agent hacks for Queens [puppet] - 10https://gerrit.wikimedia.org/r/576928 [19:12:42] I think there are some on Wikitech but not sure how updated are they. [19:12:46] ottomata: instructions are at https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#mediawiki/extensions_and_mediawiki/skins [19:13:09] feel free to ask if you have any questions [19:13:11] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_the_deployment_host [19:13:11] ? [19:13:20] ok [19:13:30] PROBLEM - SSH on notebook1003 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:13:47] cool doing that [19:13:48] (03CR) 10jerkins-bot: [V: 04-1] neutron: update l3_agent hacks for Queens [puppet] - 10https://gerrit.wikimedia.org/r/576928 (owner: 10Andrew Bogott) [19:14:02] ottomata: the one you linked also works [19:15:40] RECOVERY - SSH on notebook1003 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:15:58] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [19:16:28] (03PS2) 10Andrew Bogott: neutron: update l3_agent hacks for Queens [puppet] - 10https://gerrit.wikimedia.org/r/576928 [19:16:36] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:16:42] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:17:00] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:17:10] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:17:20] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:17:26] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:17:28] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:19:15] (03CR) 10Andrew Bogott: [C: 03+2] keystone: port some custom .py files to python3 [puppet] - 10https://gerrit.wikimedia.org/r/576927 (owner: 10Andrew Bogott) [19:20:08] tgr: yt? [19:20:21] ottomata: o/ [19:20:30] hello! am trying to test your change on test.wikipedia.org using mwdebug1001 [19:20:54] uh, which change is that? [19:21:02] oh, right, the error logging one [19:21:05] ya [19:21:06] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/576540 [19:21:25] it is live there [19:21:34] I haven't tested TBH, but seemed like a trivial fix [19:21:38] aye [19:21:44] am doing a mw.track('global.error', ...) but not seeing much change [19:21:58] is there way to check what handlers are installed for a topic? [19:22:11] not without a debugger [19:22:20] I can check, give me a sec [19:22:25] ok ty [19:23:22] ottomata: you sure you're testing the correct change at correct wiki? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/576923 seems to be for wmf.21, per https://tools.wmflabs.org/versions, group0 (=testwikis) are at wmf.22 [19:23:33] .... [19:23:36] OH RIGHT BECAUSE OF TRAIN [19:23:37] doh [19:23:55] wait...right. ok making another brancih patch [19:27:19] 10Operations, 10ops-eqiad, 10DC-Ops: audit/rebalance power in a5-eqiad - https://phabricator.wikimedia.org/T245655 (10wiki_willy) Thanks @ayounsi , the ones you pasted should have a threshold set to 3.44kw (or 3440 watts). Actually, would it be possible setting all the PDUs we have in just eqiad and codfw (... [19:29:40] ok tgr heh, should be deployed properly on mwdebug1001 right now [19:33:47] 10Operations, 10fundraising-tech-ops, 10netops: DHCP routing issue with civi2001 - https://phabricator.wikimedia.org/T246812 (10Papaul) @ayounsi thanks for the troubleshooting I will look into this tomorrow. [19:34:30] 10Operations, 10fundraising-tech-ops, 10netops: DHCP routing issue with civi2001 - https://phabricator.wikimedia.org/T246812 (10Papaul) p:05Triage→03Medium [19:34:58] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:11] (03CR) 10Herron: [C: 03+1] "let's try it! IMO worth disabling puppet on the collectors for a controlled deploy, just in case" [puppet] - 10https://gerrit.wikimedia.org/r/576908 (https://phabricator.wikimedia.org/T239090) (owner: 10Cwhite) [19:36:54] (03CR) 10Herron: [C: 03+1] "let's try it! IMO worth disabling puppet on the collectors for a controlled deploy, just in case" [puppet] - 10https://gerrit.wikimedia.org/r/576910 (https://phabricator.wikimedia.org/T239458) (owner: 10Cwhite) [19:37:16] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:37:22] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:37:24] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:37:52] PROBLEM - SSH on notebook1003 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:38:06] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [19:38:46] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:39:08] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:39:20] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:39:46] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:43:11] !log restart logstash on logstash2005 -- testing field type mismatch mitigation [19:43:47] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10chasemp) @hmarcus works as intended. Thanks. [19:45:09] ottomata: sorry, I got disctracted [19:45:12] np [19:45:16] seems to work fine, though [19:45:21] oh? [19:45:27] how can you tell? [19:45:33] I see a network request to https://intake-logging.wikimedia.org/v1/events?hasty=true [19:45:38] OH [19:45:39] you do? [19:45:40] that is good.... [19:45:50] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for mw2350 to mw2365 [dns] - 10https://gerrit.wikimedia.org/r/576934 [19:45:56] OH i do too! [19:46:05] ok, then it is something on the server side [19:46:07] sometimes you need to wait 5 mins after deploying a JS change, due to ResourceLoader caching [19:46:09] thank you! ok will investigate [19:46:11] ahhh [19:46:12] intresting [19:46:13] ok [19:46:22] then swat over thank you! [19:46:23] :) [19:46:44] RECOVERY - SSH on notebook1003 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:50:37] (03CR) 10Jhedden: [C: 03+2] openstack: switch cloudvirt-wdqs servers to Ceph [puppet] - 10https://gerrit.wikimedia.org/r/576903 (https://phabricator.wikimedia.org/T221631) (owner: 10Jhedden) [19:51:38] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [19:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:22] !log otto@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: SWAT: [[gerrit:576931|Fix callback parameters for client error logging (T246030)]] (duration: 01m 07s) [19:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:26] T246030: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 [20:00:04] liw and Brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - European+American Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T2000). [20:16:22] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [20:16:46] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:16:58] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:17:00] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:17:06] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:17:12] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:17:14] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:17:58] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [20:21:04] 10Operations, 10Phabricator, 10Traffic: Phabricator is inaccessible from Egypt - https://phabricator.wikimedia.org/T246923 (10Aklapper) Hi, this might be intended if you use certain providers / IP addresses which were also used by a persistent vandal (non-public reference: T218589#5033515 ). [20:22:50] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1003 is OK: OK: synced at Wed 2020-03-04 20:22:49 UTC. https://wikitech.wikimedia.org/wiki/NTP [20:29:16] (03PS1) 10RobH: adding R640 skus [software] - 10https://gerrit.wikimedia.org/r/576944 [20:29:57] !log otto@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: [[gerrit:576942|Include required url in mediawiki/client/error event (T246030)]] (duration: 01m 05s) [20:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:02] T246030: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 [20:33:48] tgr: another q for you. can you think of a good way to make the mw.errorLogger window.onerror handler fire? [20:34:09] mw.track('global.error' is working! [20:34:17] now i want mw.errorLogger to make it happen [20:47:07] Hi! CPT will need to request a new wiki for the Developer Portal we're working on. I want to make sure I'm following the right steps as I have not done this before. I'm following along with [20:47:07] https://wikitech.wikimedia.org/wiki/Add_a_wiki [20:47:07] But it's unclear to me if once I file the tickets from the Notify section I can proceed directly to submitting patches for the next steps. [20:50:50] jouncebot: now [20:50:50] For the next 0 hour(s) and 9 minute(s): Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T2000) [20:50:58] wdoran: your best bet is to ask the clinic duty person in the channel topic to get specific answers [20:51:15] but they are like UTC+5 or something I think so may have to wait a day [20:51:23] groovy tak [21:00:04] cscott, arlolra, subbu, halfak, and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200304T2100). [21:06:48] (03PS1) 10EBernhardson: [cirrus] Configuration for glent m0 AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576952 (https://phabricator.wikimedia.org/T246947) [21:07:20] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] Configuration for glent m0 AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576952 (https://phabricator.wikimedia.org/T246947) (owner: 10EBernhardson) [21:07:48] wdoran: as I often participate in creating wikis, creting a task against wiki-setup (create) with details what should your wiki be configured like is enough, and you just need to watch the ticket for question asked :-). The actual preparation for creation is DNS, apache and language (if applicable), described in the Preparation section of https://wikitech.wikimedia.org/wiki/Add_a_wiki, plus mediawiki config is usually done [21:07:48] beforehand. The rest is done during the actual wiki creation, done by a deployer (usually, more pending wikis are created at once). Hope that helps! [21:09:06] urbanecm: that does, thank you [21:09:16] 10Operations, 10Phabricator, 10Traffic: Phabricator is inaccessible from Egypt - https://phabricator.wikimedia.org/T246923 (10Krenair) If it is that I would not expect HTTP 501 responses. [21:09:22] happy to help! [21:11:06] (03CR) 10Ottomata: "This should be ready to go! Let's merge tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/573369 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [21:15:20] (03PS3) 10Andrew Bogott: neutron: update l3_agent hacks for Queens [puppet] - 10https://gerrit.wikimedia.org/r/576928 [21:15:22] (03PS1) 10Andrew Bogott: Neutron: add manifests for queens [puppet] - 10https://gerrit.wikimedia.org/r/576953 [21:15:24] (03PS1) 10Andrew Bogott: nova: add openstack queens manifests [puppet] - 10https://gerrit.wikimedia.org/r/576954 [21:15:26] (03PS1) 10Andrew Bogott: glance: add queens service manifest [puppet] - 10https://gerrit.wikimedia.org/r/576955 [21:15:28] (03PS1) 10Andrew Bogott: keystone: add Queens service manifests [puppet] - 10https://gerrit.wikimedia.org/r/576956 [21:15:30] (03PS1) 10Andrew Bogott: cloud-vps client packages: add Queens manifests [puppet] - 10https://gerrit.wikimedia.org/r/576957 [21:15:34] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for logstash102[6-9] [dns] - 10https://gerrit.wikimedia.org/r/576909 (https://phabricator.wikimedia.org/T240881) (owner: 10Cmjohnson) [21:16:05] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [21:16:05] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [21:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:25] 10Operations, 10Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10bd808) [21:16:31] (03PS1) 10CDanis: LVS: add alert on CPU saturation, which causes pkt drops [puppet] - 10https://gerrit.wikimedia.org/r/576958 [21:17:10] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10Patch-For-Review: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10Cmjohnson) [21:17:22] 10Operations, 10Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [21:17:43] (03CR) 10jerkins-bot: [V: 04-1] Neutron: add manifests for queens [puppet] - 10https://gerrit.wikimedia.org/r/576953 (owner: 10Andrew Bogott) [21:18:19] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Cmjohnson) a:03Jgreen Removing ops-eqiad tag and assigning to @Jgreen [21:18:44] (03CR) 10jerkins-bot: [V: 04-1] keystone: add Queens service manifests [puppet] - 10https://gerrit.wikimedia.org/r/576956 (owner: 10Andrew Bogott) [21:19:28] (03PS2) 10CDanis: LVS: add alert on CPU saturation, which causes pkt drops [puppet] - 10https://gerrit.wikimedia.org/r/576958 [21:21:26] (03PS2) 10Andrew Bogott: Neutron: add manifests for queens [puppet] - 10https://gerrit.wikimedia.org/r/576953 [21:21:31] (03PS2) 10Andrew Bogott: nova: add openstack queens manifests [puppet] - 10https://gerrit.wikimedia.org/r/576954 [21:21:58] (03PS2) 10Andrew Bogott: glance: add queens service manifest [puppet] - 10https://gerrit.wikimedia.org/r/576955 [21:22:19] (03PS2) 10Andrew Bogott: keystone: add Queens service manifests [puppet] - 10https://gerrit.wikimedia.org/r/576956 [21:23:08] (03PS2) 10Andrew Bogott: cloud-vps client packages: add Queens manifests [puppet] - 10https://gerrit.wikimedia.org/r/576957 [21:23:10] (03PS4) 10Andrew Bogott: neutron: update l3_agent hacks for Queens [puppet] - 10https://gerrit.wikimedia.org/r/576928 [21:24:18] (03CR) 10jerkins-bot: [V: 04-1] keystone: add Queens service manifests [puppet] - 10https://gerrit.wikimedia.org/r/576956 (owner: 10Andrew Bogott) [21:28:22] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [21:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:11] (03CR) 10CDanis: [C: 03+2] "Query tested by hand with icinga command expander + check_prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/576958 (owner: 10CDanis) [21:35:18] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [21:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:02] (03PS1) 10Urbanecm: Add gewikimedia to special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576965 [21:39:28] (03PS1) 10RLazarus: site: Assign mw14{05,07,09,11,13} as appservers. [puppet] - 10https://gerrit.wikimedia.org/r/576966 (https://phabricator.wikimedia.org/T241849) [21:39:49] (03PS1) 10Herron: elasticsearch: add max_clause_count setting [puppet] - 10https://gerrit.wikimedia.org/r/576967 (https://phabricator.wikimedia.org/T234854) [21:39:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:41:18] (03CR) 10MarcoAurelio: [C: 03+1] Add gewikimedia to special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576965 (owner: 10Urbanecm) [21:41:20] (03CR) 10Urbanecm: [C: 03+2] Add gewikimedia to special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576965 (owner: 10Urbanecm) [21:41:22] (03CR) 10Herron: [C: 04-1] "wip for now" [puppet] - 10https://gerrit.wikimedia.org/r/576967 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:41:24] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add max_clause_count setting [puppet] - 10https://gerrit.wikimedia.org/r/576967 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:42:59] (03PS2) 10Herron: elasticsearch: add max_clause_count setting [puppet] - 10https://gerrit.wikimedia.org/r/576967 (https://phabricator.wikimedia.org/T234854) [21:43:05] !log urbanecm@deploy1001 Synchronized dblists/special.dblist: 8decd01: Add gewikimedia to special wikis (duration: 01m 06s) [21:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:53] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576968 [21:43:55] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576968 (owner: 10Urbanecm) [21:44:59] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576968 (owner: 10Urbanecm) [21:46:08] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 25s) [21:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:17] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [21:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:26] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [21:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:15] (03CR) 10CDanis: "OK, looks like I finally got this right: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=saturated" [puppet] - 10https://gerrit.wikimedia.org/r/576958 (owner: 10CDanis) [21:47:23] 10Operations, 10Phabricator, 10Traffic: Phabricator is inaccessible from Egypt - https://phabricator.wikimedia.org/T246923 (10ahmad) Shpuldn't this be a 403/Forbidden response? [21:48:43] 10Operations, 10Phabricator, 10Traffic: Phabricator is inaccessible from Egypt: HTTP 501 error - https://phabricator.wikimedia.org/T246923 (10Aklapper) [21:49:35] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: add manifests for queens [puppet] - 10https://gerrit.wikimedia.org/r/576953 (owner: 10Andrew Bogott) [21:49:49] (03CR) 10Andrew Bogott: [C: 03+2] nova: add openstack queens manifests [puppet] - 10https://gerrit.wikimedia.org/r/576954 (owner: 10Andrew Bogott) [21:50:00] (03CR) 10Andrew Bogott: [C: 03+2] glance: add queens service manifest [puppet] - 10https://gerrit.wikimedia.org/r/576955 (owner: 10Andrew Bogott) [21:50:25] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] keystone: add Queens service manifests [puppet] - 10https://gerrit.wikimedia.org/r/576956 (owner: 10Andrew Bogott) [21:50:42] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps client packages: add Queens manifests [puppet] - 10https://gerrit.wikimedia.org/r/576957 (owner: 10Andrew Bogott) [21:57:40] (03PS5) 10Andrew Bogott: neutron: update l3_agent hacks for Queens [puppet] - 10https://gerrit.wikimedia.org/r/576928 [21:57:42] (03PS1) 10Andrew Bogott: cloudservices: update designate to openstack Queens [puppet] - 10https://gerrit.wikimedia.org/r/576972 (https://phabricator.wikimedia.org/T242766) [22:00:38] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [22:00:40] 10Operations, 10Phabricator, 10Traffic: Phabricator is inaccessible from Egypt: HTTP 501 error - https://phabricator.wikimedia.org/T246923 (10Urbanecm) >>! In T246923#5943004, @Krenair wrote: > If it is that I would not expect HTTP 501 responses. >>! In T246923#5943149, @ahmad wrote: > Shpuldn't this be a 4... [22:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:55] (03PS1) 10Dzahn: site/conftool: add new appservers in eqiad row B [puppet] - 10https://gerrit.wikimedia.org/r/576973 (https://phabricator.wikimedia.org/T241849) [22:02:58] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [22:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:17] (03PS2) 10Dzahn: site/conftool: add new appservers in eqiad row B [puppet] - 10https://gerrit.wikimedia.org/r/576973 (https://phabricator.wikimedia.org/T241849) [22:05:52] (03CR) 10RLazarus: [C: 03+1] site/conftool: add new appservers in eqiad row B [puppet] - 10https://gerrit.wikimedia.org/r/576973 (https://phabricator.wikimedia.org/T241849) (owner: 10Dzahn) [22:09:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:10:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:09] (03CR) 10Dzahn: [C: 03+2] site/conftool: add new appservers in eqiad row B [puppet] - 10https://gerrit.wikimedia.org/r/576973 (https://phabricator.wikimedia.org/T241849) (owner: 10Dzahn) [22:11:17] 10Operations, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 7 host(s) and their services with reason: new_install ` mw[1393-1399].eqi... [22:11:20] (03PS3) 10Dzahn: site/conftool: add new appservers in eqiad row B [puppet] - 10https://gerrit.wikimedia.org/r/576973 (https://phabricator.wikimedia.org/T241849) [22:11:21] 10Operations, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install ` mw[1400-1404].eqi... [22:11:26] 10Operations, 10ops-eqiad, 10User-jbond, 10cloud-services-team (Hardware): drain cloudvirt1006 for battery replacement - https://phabricator.wikimedia.org/T246908 (10Andrew) I've announced that this draining will happen on Friday at 15:00UTC (9AM my time) [22:12:56] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [22:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:20] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [22:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:57] Urbanecm: Please stop. [22:17:03] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [22:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:23] James_F: with...? I'm not doing anything now [22:17:44] Urbanecm: You just re-opened a task which is fixed, wrongly, and then duped a different task into said open task. [22:18:49] James_F: No, I didn't. Krinkle did https://gerrit.wikimedia.org/r/#/c/553228/, which I (without the knowledge of T239301) kinda reverted in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/576965. [22:18:50] T239301: Ensure all wikis are configured to be in exactly one "family" (wikipedia/wiktionary/special/…) - https://phabricator.wikimedia.org/T239301 [22:19:19] No one cares about interwikis. [22:19:30] what do you mean? [22:19:40] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices: update designate to openstack Queens [puppet] - 10https://gerrit.wikimedia.org/r/576972 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [22:20:28] (03PS1) 10Jforrester: Revert "Add gewikimedia to special wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576974 [22:21:09] and regarding T183549... Isn't that about "arbcom_* wikis are in both wikipedia and special", which, thanks to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/553226 (linked to T239301), is no longer true? [22:21:10] T183549: Arbcom wikis are in both wikipedia.dblist and special.dblist - https://phabricator.wikimedia.org/T183549 [22:21:32] I think there are some which are still duped? Need to check. [22:22:00] But we're re-doing everything about dblists right now. Making changes in this area is very unhelpful. :-) [22:23:06] James_F: Seems that no arbcom_* wiki is currently in special now. [22:24:01] James_F: I'm curious, if https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/576974 gets merged, how are they going to use the interwikis then? [22:24:11] (03PS1) 10Dzahn: add fake certs for mw1393 through mw1404 [labs/private] - 10https://gerrit.wikimedia.org/r/576975 (https://phabricator.wikimedia.org/T241849) [22:24:11] Badly. [22:24:44] Interwiki links are such a low-value, high-cost feature. Maybe we should just kill them. [22:24:55] (03PS1) 10Krinkle: tests: Re-enable 'family' dblist test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576977 (https://phabricator.wikimedia.org/T239301) [22:25:11] Thanks, Krinkle. [22:25:26] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake certs for mw1393 through mw1404 [labs/private] - 10https://gerrit.wikimedia.org/r/576975 (https://phabricator.wikimedia.org/T241849) (owner: 10Dzahn) [22:26:11] (03CR) 10jerkins-bot: [V: 04-1] tests: Re-enable 'family' dblist test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576977 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [22:26:18] James_F: Does that we're going to keep stuff intentionally broken? [22:27:26] (03PS2) 10RLazarus: site: Assign mw14{05,07,09,11,13} as appservers. [puppet] - 10https://gerrit.wikimedia.org/r/576966 (https://phabricator.wikimedia.org/T241849) [22:27:49] Urbanecm: It's MediaWiki. [22:27:51] (03CR) 10jerkins-bot: [V: 04-1] site: Assign mw14{05,07,09,11,13} as appservers. [puppet] - 10https://gerrit.wikimedia.org/r/576966 (https://phabricator.wikimedia.org/T241849) (owner: 10RLazarus) [22:28:11] (03PS3) 10RLazarus: site: Assign appservers and API servers in eqiad row C. [puppet] - 10https://gerrit.wikimedia.org/r/576966 (https://phabricator.wikimedia.org/T241849) [22:28:42] (03CR) 10jerkins-bot: [V: 04-1] site: Assign appservers and API servers in eqiad row C. [puppet] - 10https://gerrit.wikimedia.org/r/576966 (https://phabricator.wikimedia.org/T241849) (owner: 10RLazarus) [22:28:48] (03PS2) 10Jforrester: tests: Re-enable 'family' dblist test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576977 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [22:29:06] (03PS3) 10Krinkle: tests: Re-enable 'family' dblist test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576977 (https://phabricator.wikimedia.org/T239301) [22:29:53] James_F: could you please explain to me what's are the next steps at this issue? [22:30:13] (and why T183549 isn't a duplicate of T239301) [22:30:14] T183549: Arbcom wikis are in both wikipedia.dblist and special.dblist - https://phabricator.wikimedia.org/T183549 [22:30:14] T239301: Ensure all wikis are configured to be in exactly one "family" (wikipedia/wiktionary/special/…) - https://phabricator.wikimedia.org/T239301 [22:30:16] (03CR) 10jerkins-bot: [V: 04-1] tests: Re-enable 'family' dblist test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576977 (https://phabricator.wikimedia.org/T239301) (owner: 10Krinkle) [22:30:53] Urbanecm: The immediate next steps are to revert your breakage. Then we'll have to think about whether it's possible to fix interwikis or not. [22:31:03] (03PS4) 10RLazarus: site: Assign appservers and API servers in eqiad row C. [puppet] - 10https://gerrit.wikimedia.org/r/576966 (https://phabricator.wikimedia.org/T241849) [22:31:25] Urbanecm: 2 minutes is not long enough to leave a commit in mw-config for me to review it. :-) [22:31:53] (03CR) 10Dzahn: [C: 03+1] site: Assign appservers and API servers in eqiad row C. [puppet] - 10https://gerrit.wikimedia.org/r/576966 (https://phabricator.wikimedia.org/T241849) (owner: 10RLazarus) [22:32:22] Urbanecm: I guess the interwiki config is influenced by something other than site config, I reviewed the diff at the time and it made no meaningful changes. [22:32:30] Might need a fix in the interwiki maintenance script indeed. [22:32:33] We can look at that later [22:32:49] Please file a task with the issue that gewikimedia experienced and what they'd like the behaviour to be instead. [22:33:08] (03PS4) 10Krinkle: tests: Re-enable 'family' dblist test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576977 (https://phabricator.wikimedia.org/T239301) [22:33:18] (03CR) 10Dzahn: [C: 03+1] "just don't forget to make the mcrouter certs first like i just did on my merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/576966 (https://phabricator.wikimedia.org/T241849) (owner: 10RLazarus) [22:33:36] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:33:42] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:33:59] Krinkle: I guess T239173 can do the job here? 🙂 [22:34:00] T239173: gewikimedia's w interwiki links to (nonexistent) gewiki - https://phabricator.wikimedia.org/T239173 [22:34:11] (or should i re-fill that as a new task?) [22:34:32] James_F: if you could explain what does the change break, I'd really appreciate it [22:34:34] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [22:35:41] Urbanecm: When I work it out, I'll tell you. [22:35:54] Urbanecm: It means that InitialiseSettings.php is unpredictable. What happens if a wg setting is true for 'wikimedia' and false for 'special'? [22:36:22] (03PS1) 10C. Scott Ananian: Update scandium (parsoid testing machine) to reflect new Parsoid configuration [puppet] - 10https://gerrit.wikimedia.org/r/576979 (https://phabricator.wikimedia.org/T240055) [22:36:41] But yes, in general, taht. [22:36:58] Krinkle: hmm, what would happen if a setting is true for small and false for wikimedia? Or is that the same issue? [22:36:59] I did not anticipate that interwiki map also looks at family. We'll need to figure out a way to change that. Possibly with a separate dblist for that purpose unrelated to wgConf, we already have many dblists like this. [22:37:13] Krinkle: Eurgh. But yeah. [22:37:25] Urbanecm: Yes, that is effectively the same issue. A single setting must not use multiple tags that overlap. [22:37:45] cscott: i can merge that if ready [22:38:13] This can be reasonably done in code review if you know that family is 1:1. But if a wiki is in wikimania + special, we can never use them safely. [22:38:33] It is okay that we cannot use small/wikimedia mixed. It is not okay that we can't use special/wikimania/wikipedia anywhere in wmf-config. [22:39:10] Krinkle: thanks for the explanation. [22:39:56] (03PS5) 10Krinkle: tests: Re-enable 'family' dblist test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576977 (https://phabricator.wikimedia.org/T239301) [22:41:13] Krinkle: just one small question, do you think we should reopen T239173, or fill a new task? Can do either of those things, just unsure which fits better [22:41:14] T239173: gewikimedia's w interwiki links to (nonexistent) gewiki - https://phabricator.wikimedia.org/T239173 [22:41:41] Urbanecm: that seems fine to re-open yeah. [22:41:59] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/576663 (owner: 10Jbond) [22:42:13] Urbanecm: Sorry for shouting. [22:42:27] (03PS1) 10Volans: gitignore: add paths used for local testing [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576984 [22:42:29] (03PS1) 10Volans: dns: convert Netbox data gathering into a class [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576985 (https://phabricator.wikimedia.org/T233183) [22:42:31] (03PS1) 10Volans: dns: convert records management in classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576986 (https://phabricator.wikimedia.org/T233183) [22:42:33] (03PS1) 10Volans: dns: fix sub/24 IPv4 netmasks file generation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576987 (https://phabricator.wikimedia.org/T233183) [22:42:58] (03CR) 10jerkins-bot: [V: 04-1] dns: convert Netbox data gathering into a class [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576985 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [22:43:25] Krinkle: done, thanks. [22:44:53] James_F: thanks. It's hard to understand what actually is wrong with just limited information from your messages :-) [22:45:02] Thanks again Krinkle for explaining the issue. [22:45:30] (03PS2) 10Volans: dns: convert Netbox data gathering into a class [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576985 (https://phabricator.wikimedia.org/T233183) [22:45:32] (03PS2) 10Volans: dns: convert records management in classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576986 (https://phabricator.wikimedia.org/T233183) [22:45:34] (03PS2) 10Volans: dns: fix sub/24 IPv4 netmasks file generation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576987 (https://phabricator.wikimedia.org/T233183) [22:46:02] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) @ArielGlenn Hey, do you still need a separate puppetmaster (deployment-dumps-puppetmaster02) for deployment-snapshot01, distinct from the usu... [22:46:27] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [22:47:22] jouncebot: now [22:47:22] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [22:47:24] jouncebot: next [22:47:24] In 1 hour(s) and 12 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200305T0000) [22:49:00] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:49:01] (03PS1) 10C. Scott Ananian: Remove Parsoid node service; replace with git checkout on RT testing server [puppet] - 10https://gerrit.wikimedia.org/r/576990 (https://phabricator.wikimedia.org/T240055) [22:50:04] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576984 (owner: 10Volans) [22:51:04] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22049 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:51:13] (03CR) 10C. Scott Ananian: "This might be overly-ambitious -- all I really want is for a git checkout to end up in /srv/parsoid-testing" [puppet] - 10https://gerrit.wikimedia.org/r/576990 (https://phabricator.wikimedia.org/T240055) (owner: 10C. Scott Ananian) [22:51:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:51:24] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:53:02] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:57] (03CR) 10Subramanya Sastry: "I would think there are uses of this class in the codebase (besides rt testing) that need to be inspected to make sure this does the right" [puppet] - 10https://gerrit.wikimedia.org/r/576990 (https://phabricator.wikimedia.org/T240055) (owner: 10C. Scott Ananian) [22:55:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:42] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:48] 10Operations, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 7 host(s) and their services with reason: new_install ` mw[1393-1399].eqi... [22:57:50] (03CR) 10Jforrester: "Looks probably right." [puppet] - 10https://gerrit.wikimedia.org/r/576990 (https://phabricator.wikimedia.org/T240055) (owner: 10C. Scott Ananian) [22:59:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:17] 10Operations, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install ` mw[1400-1404].eqi... [23:16:11] (03CR) 10Dzahn: [C: 03+2] Update scandium (parsoid testing machine) to reflect new Parsoid configuration [puppet] - 10https://gerrit.wikimedia.org/r/576979 (https://phabricator.wikimedia.org/T240055) (owner: 10C. Scott Ananian) [23:16:20] (03PS1) 10Bstorm: toolforge: remove monitoring for old k8s cluster nodes and flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/576992 (https://phabricator.wikimedia.org/T246689) [23:16:47] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:19:07] (03PS3) 10Jforrester: [WiP] Provide infrastructure to create InitialiseSettings.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576514 [23:23:35] Krinkle: I'd appreciate you eyeballing the output of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/576514/ before I write a patch on top of it to wire it up in CommonSettings.php [23:24:02] (03CR) 10Dzahn: [C: 04-1] "this would break prod wtp servers and not just affect scandium. also fails because:" [puppet] - 10https://gerrit.wikimedia.org/r/576990 (https://phabricator.wikimedia.org/T240055) (owner: 10C. Scott Ananian) [23:24:31] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/21274/" [puppet] - 10https://gerrit.wikimedia.org/r/576990 (https://phabricator.wikimedia.org/T240055) (owner: 10C. Scott Ananian) [23:25:27] PROBLEM - Disk space on stat1007 is CRITICAL: DISK CRITICAL - free space: /srv 279197 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [23:26:26] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/CirrusSearch/includes/: T245303 (duration: 01m 02s) [23:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:32] T245303: CirrusSearch logging logs with reserved parameter message - https://phabricator.wikimedia.org/T245303 [23:26:39] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Dwisehaupt) Base OS installed. Puppet runs aren't clean yet due to the private repo needing sync. Will catch up with that tomorrow. [23:26:46] (03CR) 10Krinkle: [C: 03+1] Revert "Add gewikimedia to special wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576974 (owner: 10Jforrester) [23:26:52] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Dwisehaupt) [23:27:06] (03PS2) 10Krinkle: multiversion: Update copy of SiteConfiguration to match current MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576889 [23:27:15] (03CR) 10Krinkle: [C: 03+2] multiversion: Update copy of SiteConfiguration to match current MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576889 (owner: 10Krinkle) [23:27:22] (03PS2) 10Krinkle: tests: Remove SiteConfiguration, use src/StaticSiteConfiguration instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576890 [23:27:26] (03CR) 10Krinkle: [C: 03+2] tests: Remove SiteConfiguration, use src/StaticSiteConfiguration instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576890 (owner: 10Krinkle) [23:28:00] (03PS2) 10Krinkle: [WIP] MWConfigCacheGenerator: Stop reading most wiki-family dblist files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576490 (https://phabricator.wikimedia.org/T169821) [23:28:04] (03CR) 10Dzahn: [C: 04-1] "So scandium uses "role(parsoid::testing)" and wtp prod servers use "role(parsoid)". It's probably easier if you change what is included in" [puppet] - 10https://gerrit.wikimedia.org/r/576990 (https://phabricator.wikimedia.org/T240055) (owner: 10C. Scott Ananian) [23:28:13] (03PS3) 10Krinkle: MWConfigCacheGenerator: Stop reading most wiki-family dblist files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576490 (https://phabricator.wikimedia.org/T169821) [23:28:17] (03Merged) 10jenkins-bot: multiversion: Update copy of SiteConfiguration to match current MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576889 (owner: 10Krinkle) [23:28:29] (03Merged) 10jenkins-bot: tests: Remove SiteConfiguration, use src/StaticSiteConfiguration instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576890 (owner: 10Krinkle) [23:28:39] * Krinkle pulling on deploy1001 but no deploy as its build/test only [23:29:04] meh, might as well sync so we don't forget in the future just in case [23:29:20] Krinkle: Yeah, it'd be a pain. :-) [23:29:20] (03CR) 10Bstorm: "PCC for toolschecker: https://puppet-compiler.wmflabs.org/compiler1001/21276/tools-checker-03.tools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/576992 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [23:30:15] (03CR) 10Jforrester: [C: 04-1] "Not currently empty; this is a '+steward' equivalent." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575390 (https://phabricator.wikimedia.org/T237890) (owner: 10Jforrester) [23:30:43] !log krinkle@deploy1001 Synchronized src/: Ic344b48a1f8 - creates StaticSiteConfiguration.php (build-only) (duration: 01m 03s) [23:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:49] James_F: RE IS.json, will take a look. Note that sprint to flesh out the wider YAML and possible build step or commit is scheduled for upcoming quarter (was discussed earlier this week with Greg and confirmed, yay!) [23:30:59] Excellent. [23:31:06] I suppose you know that already , but just confirming Perf is on-board [23:31:09] in terms of resourcing [23:31:15] Yeah, I didn't know if you'd be free. [23:31:24] has anyone seen puppet-master logging *everything* to /var/log/debug before? [23:31:24] OK, will pretend I've not worked on it for the next four weeks. ;-) [23:31:49] Krenair: -sre might answer if not here. [23:31:56] good point [23:32:27] (03CR) 10Krinkle: [C: 04-1] "Well, now what. It still produces an interesting diff.." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576490 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [23:32:33] James_F: another bug? [23:32:45] (03CR) 10Dzahn: [C: 03+1] DNS: Add mgmt and production DNS for mw2350 to mw2365 [dns] - 10https://gerrit.wikimedia.org/r/576934 (owner: 10Papaul) [23:33:20] Woah, that's surprising [23:33:48] ah. crap, well, it's genuine and I think in this case the bad result was intentional [23:33:57] there is a clash between 'commonsuploads' and 'wikinews' [23:34:13] I'm not sure why but it seems that 'wikinews' is meant to win there [23:34:21] 'commonsuploads' => '//commons.wikimedia.org/wiki/Special:UploadWizard?uselang=$lang', [23:34:21] 'wikinews' => '//commons.wikimedia.org/wiki/Special:UploadWizard', [23:36:45] (03PS1) 10Bstorm: toolforge: remove old k8s client material for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/576995 (https://phabricator.wikimedia.org/T246689) [23:39:20] Krinkle: Yeah, I guess? [23:39:30] git-blamed to: [23:39:31] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/88072/ [23:39:34] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/245945/ [23:40:02] I'm OK with wikinews regressing. [23:40:27] It's not a major feature; will only affect people using the upload link in a non-default interface language. [23:41:00] yeah, if anything I think it was unintentional to leave wikinews unchanged. [23:41:13] Let's over-ride and proceed. [23:41:19] need to check for any wikinews wikis that have subdomains that are not MW language codes [23:41:36] There aren't any. [23:41:44] There are only a handful of wikinewses. [23:46:06] hm.. indeed, there are only wikinewses with 2-letter db prefixes, and bhwiki has no wikinews counterpart [23:47:35] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for mw2350 to mw2365 [dns] - 10https://gerrit.wikimedia.org/r/576934 (owner: 10Papaul) [23:47:42] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for mw2350 to mw2365 [dns] - 10https://gerrit.wikimedia.org/r/576934 [23:47:47] (03CR) 10Papaul: [V: 03+2 C: 03+2] DNS: Add mgmt and production DNS for mw2350 to mw2365 [dns] - 10https://gerrit.wikimedia.org/r/576934 (owner: 10Papaul) [23:49:47] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [23:55:00] (03PS1) 10Krinkle: Fix wgUploadNavigationUrl conflict between 'commonsuploads' and 'wikinews' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576999 [23:55:07] James_F: ugh, there's two other settings that also have the same overlap between the two [23:55:22] There's actually 24 wikinews wikis not in 'commonsuploads' [23:55:33] which includes permission settings [23:55:56] once we fix this ambiguity would be good to have a unit tests to hard reject configs that introduce such ambiguity [23:56:25] e.g. liwikinews has wgEnableUploads=false [23:56:35] easy to substitute for now though, but surprising indeed [23:56:51] PROBLEM - Check systemd state on mw1397 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:03] PROBLEM - Check systemd state on mw1394 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:23] PROBLEM - Check systemd state on mw1399 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:41] PROBLEM - Check systemd state on mw1398 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:59] PROBLEM - Check systemd state on mw1395 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:15] PROBLEM - Check systemd state on mw1396 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:41] those are new servers. i got it [23:58:53] PROBLEM - Check systemd state on mw1393 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:59] new but known issue with envoyproxy [23:59:13] PROBLEM - Check systemd state on mw1400 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state