[00:06:38] it's funny that jouncebot does the parsing, I made a thing that does the calendaring *from* json to wikitext so I don't have to generate the calendar by hand [00:07:15] which is to say: deployment calendar as a wikipage means we make weird things to genrate it and weird things to read it [00:09:27] !log jforrester@deploy1002 Started deploy [integration/docroot@63b6fb6]: Sync with CI updates (no-op) [00:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:35] !log jforrester@deploy1002 Finished deploy [integration/docroot@63b6fb6]: Sync with CI updates (no-op) (duration: 00m 08s) [00:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:32] Actual automated conch management/pinging/deployment tooling would be very nice. [00:13:15] thcipriani: I guess around these parts we measure cyclomatic complexity by the number of times data tranforms into and out of wikitext. [00:13:33] :D [00:16:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:23:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:26:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1018.wikimedia.org', 'cloudcephosd1016.wikimedia.org', 'clo... [00:26:55] (03PS1) 10BryanDavis: Reparse deploy page before announcing an event [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/680033 (https://phabricator.wikimedia.org/T243394) [00:30:57] !log Delete old data at doc1001:/srv/doc/cover/PasswordBlacklist (ref T254799) [00:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:06] T254799: Rename wikimedia/password-blacklist library - https://phabricator.wikimedia.org/T254799 [00:32:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:33:40] Krinkle thanks [00:38:02] its gone from https://doc.wikimedia.org/cover/ but still exists at https://doc.wikimedia.org/cover/mediawiki-libs-PasswordBlacklist/ [00:38:48] DannyS712: cdn cache [00:38:54] will expire on its own [00:38:59] okay [00:39:03] https://doc.wikimedia.org/cover/mediawiki-libs-PasswordBlacklist/?_not_here [00:39:36] the main page also now lists CLDRPluralRuleParser, which I don' think was there earlier - going to go write some tests for that... [00:41:12] RECOVERY - WDQS SPARQL on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.109 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:41:44] RECOVERY - Query Service HTTP Port on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [00:42:26] "Generated … Fri Mar 26 2021" [00:42:42] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:43] in that case, I guess I just missed it :) [00:43:18] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:40] (03PS1) 10Bstorm: gridengine: set additional grid-configurator source files to new domain [puppet] - 10https://gerrit.wikimedia.org/r/680038 (https://phabricator.wikimedia.org/T277653) [00:45:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:50:56] (03PS2) 10Bstorm: gridengine: set additional grid-configurator source files to new domain [puppet] - 10https://gerrit.wikimedia.org/r/680038 (https://phabricator.wikimedia.org/T277653) [00:53:32] (03CR) 10Bstorm: [C: 03+2] gridengine: set additional grid-configurator source files to new domain [puppet] - 10https://gerrit.wikimedia.org/r/680038 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm) [01:06:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:17:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:40:14] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [01:47:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:52:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:05:00] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:42:55] (03PS1) 10Ryan Kemper: wdqs: int can't take in float as string [cookbooks] - 10https://gerrit.wikimedia.org/r/680095 (https://phabricator.wikimedia.org/T280108) [02:47:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:53:10] (03PS1) 10Andrew Bogott: Update trove.conf as per Ussuri release notes [puppet] - 10https://gerrit.wikimedia.org/r/680102 (https://phabricator.wikimedia.org/T212595) [02:54:16] (03CR) 10Andrew Bogott: [C: 03+2] Update trove.conf as per Ussuri release notes [puppet] - 10https://gerrit.wikimedia.org/r/680102 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [02:54:59] (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 (owner: 10Krinkle) [02:55:09] (03CR) 10jerkins-bot: [V: 04-1] [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 (owner: 10Krinkle) [02:55:50] (03PS4) 10Krinkle: [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 [02:56:01] (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 (owner: 10Krinkle) [02:56:38] (03PS4) 10Krinkle: mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734 [02:56:54] (03Merged) 10jenkins-bot: [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 (owner: 10Krinkle) [03:04:24] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [03:04:25] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [03:04:29] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [03:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:21] !log T267927 Last round of `data-transfer`s finished successfully, proceeding to next round [03:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:29] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [03:09:21] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [03:09:22] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [03:09:24] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [03:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:42] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [03:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:50] !log T267927 kicked off next round of `data-transfer`s: `wdqs1004`->`wdqs1007`, `wdqs2001`->`wdqs2003`, `wdqs1003`->`wdqs1008`, `wdqs2008`->`wdqs2004` [03:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:17:36] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1013.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1013.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:18:00] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1013.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [03:18:06] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:20:58] ^ My mistake, forgot the cookbook in its current state is not repooling [03:22:22] !log T267927 Pooled `wdqs1006` and `wdqs2002` [03:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:31] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [03:22:42] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:22] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [03:24:44] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:25:39] Looks like `wdqs1013` was down for a bit (unsure) and the fact that I hadn't yet repooled `wdqs1006` meant pybal did not automatically de-pool it because we already had 2 of the 6 hosts down for the `data-transfer` before accounting for `wdqs1013` briefly dropping offline [03:25:49] s/(unsure)/(unsure why)/ [03:26:41] !log T267927 Pooled `wdqs2001` [03:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:04] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:31:50] !log [wdqs] `ryankemper@wdqs1013:~$ sudo systemctl restart wdqs-blazegraph` [03:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:38] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:05:20] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 19255 bytes in 4.951 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:05:36] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 19254 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:05:40] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 19253 bytes in 0.213 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:33:24] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Ladsgroup) There was one thing I missed in my comment on autolader and Joe pointed out in IRC that I did the test on mwdebug (where you can do xhgui... [05:17:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:25:27] (03PS1) 10Legoktm: [WIP] lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) [05:26:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP] lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm) [05:27:19] (03PS2) 10Legoktm: [WIP] lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) [05:30:43] (03PS1) 10Marostegui: install_server: Reimage db2094,db2095 to buster [puppet] - 10https://gerrit.wikimedia.org/r/680162 (https://phabricator.wikimedia.org/T275112) [05:31:02] 10ops-eqiad, 10Analytics: an-worker1100 disk swap required - https://phabricator.wikimedia.org/T280313 (10elukey) [05:32:25] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2094,db2095 to buster [puppet] - 10https://gerrit.wikimedia.org/r/680162 (https://phabricator.wikimedia.org/T275112) (owner: 10Marostegui) [05:35:11] 10ops-eqiad, 10Analytics: an-worker1100 disk swap required - https://phabricator.wikimedia.org/T280313 (10elukey) [05:35:13] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10elukey) [05:36:04] (03PS3) 10Legoktm: [WIP] lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) [05:36:47] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29090/console" [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm) [05:37:25] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10elukey) @razzi ` elukey@an-worker1100:~$ cat /proc/mounts | grep /var/lib/hadoop/data /dev/sdx1 /var/lib/hadoop/data/w ext4 rw,relatime 0 0 /dev/sdl1 /var/lib/hadoop/data/k ext4 ro,relatime 0 0 <============... [05:37:38] (03PS4) 10Legoktm: lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) [05:38:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-labs site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:42:16] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics-tool1001.eqiad.wmnet [05:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:05] last vm using cloudera CDH packages --^ [05:43:46] I am going to send another patch to clean up our repos very soon [05:45:54] (03CR) 10Elukey: [V: 03+1 C: 03+2] Decommission analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/679892 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey) [05:45:59] (03PS3) 10Elukey: Decommission analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/679892 (https://phabricator.wikimedia.org/T280262) [05:48:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2094.codfw.wmnet with reason: REIMAGE [05:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:32] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2094.codfw.wmnet with reason: REIMAGE [05:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:55] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:40] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics-tool1001.eqiad.wmnet [05:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:39] (03PS1) 10Marostegui: db2094: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/680166 (https://phabricator.wikimedia.org/T275112) [06:05:16] (03CR) 10Marostegui: [C: 03+2] db2094: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/680166 (https://phabricator.wikimedia.org/T275112) (owner: 10Marostegui) [06:06:45] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [06:09:10] (03PS1) 10Elukey: Remove cloudera-related packages [puppet] - 10https://gerrit.wikimedia.org/r/680167 (https://phabricator.wikimedia.org/T280262) [06:09:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-labs site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:09:20] moritzm: --^ the day has finally come! :D [06:10:16] (03CR) 10jerkins-bot: [V: 04-1] Remove cloudera-related packages [puppet] - 10https://gerrit.wikimedia.org/r/680167 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey) [06:12:11] (03PS2) 10Elukey: Remove cloudera-related packages [puppet] - 10https://gerrit.wikimedia.org/r/680167 (https://phabricator.wikimedia.org/T280262) [06:13:27] (03CR) 10Legoktm: [C: 03+1] Fix error message if MWScript.php is run without arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679517 (owner: 10Ahmon Dancy) [06:16:27] (03CR) 10Legoktm: lists: Add option to enable mailman3 on lists (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [06:16:36] (03PS5) 10Legoktm: lists: Add option to enable mailman3 on lists [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [06:17:46] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29091/console" [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [06:19:15] (03CR) 10Legoktm: [V: 03+1 C: 03+2] lists: Add option to enable mailman3 on lists [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [06:19:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2095.codfw.wmnet with reason: REIMAGE [06:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:35] (03CR) 10Elukey: "To keep archives happy - this can be done overriding the hadoop.log.dir var (with -D etc..), probably via the hadoop-env.sh file." [puppet] - 10https://gerrit.wikimedia.org/r/661391 (https://phabricator.wikimedia.org/T265126) (owner: 10Razzi) [06:20:55] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1028.eqiad.wmnet [06:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:22] (03PS1) 10Marostegui: db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/680172 (https://phabricator.wikimedia.org/T275112) [06:22:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2095.codfw.wmnet with reason: REIMAGE [06:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:07] (03CR) 10Marostegui: [C: 03+2] db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/680172 (https://phabricator.wikimedia.org/T275112) (owner: 10Marostegui) [06:26:05] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29092/console" [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712) [06:27:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-labs site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:27:46] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1028.eqiad.wmnet [06:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:59] 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) All good from the kafka-main2001 side! We can enable it everywhere [06:39:32] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1029.eqiad.wmnet [06:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:17] (03PS1) 10Elukey: role::analytics_cluster::hadoop::standby: move hadoop dirs to /srv [puppet] - 10https://gerrit.wikimedia.org/r/680179 (https://phabricator.wikimedia.org/T265126) [06:46:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29094/console" [puppet] - 10https://gerrit.wikimedia.org/r/680179 (https://phabricator.wikimedia.org/T265126) (owner: 10Elukey) [06:47:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:48:04] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1029.eqiad.wmnet [06:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:19] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1030.eqiad.wmnet [06:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:49] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1030.eqiad.wmnet [06:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210416T0700) [07:07:17] (03CR) 10Gehel: [C: 04-1] "See comment inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/680095 (https://phabricator.wikimedia.org/T280108) (owner: 10Ryan Kemper) [07:13:33] (03PS3) 10Amire80: Add default import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676930 (https://phabricator.wikimedia.org/T214139) [07:19:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P15372 and previous config saved to /var/cache/conftool/dbconfig/20210416-071936-marostegui.json [07:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:32] (03CR) 10Gehel: postgres: use remote script on replica to resync (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [07:21:45] (03Abandoned) 10Gehel: Migrate wcqs to wcqs-beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/615810 (owner: 10ZPapierski) [07:21:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for lexnasser [puppet] - 10https://gerrit.wikimedia.org/r/679854 (owner: 10Muehlenhoff) [07:27:44] (03PS1) 10Ema: cache: enable exp caching policy on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/680197 (https://phabricator.wikimedia.org/T275809) [07:30:46] (03PS2) 10Ema: cache: enable exp caching policy on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/680197 (https://phabricator.wikimedia.org/T275809) [07:32:04] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/680197 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [07:38:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:39:04] (03CR) 10Ema: [C: 03+2] cache: enable exp caching policy on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/680197 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [07:41:09] !log cp-upload_ulsfo: rolling varnish-frontend-restart to apply exp policy settings changes starting from empty caches T275809 [07:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:20] T275809: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 [07:42:53] (03CR) 10Muehlenhoff: [C: 03+1] "https://www.youtube.com/watch?v=aCbfMkh940Q" [puppet] - 10https://gerrit.wikimedia.org/r/680167 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey) [07:44:19] moritzm: nuke it! :D [07:48:04] (03CR) 10Elukey: [C: 03+2] Remove cloudera-related packages [puppet] - 10https://gerrit.wikimedia.org/r/680167 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey) [07:49:07] ema: it will be a good moment after https://www.cloudera.com/downloads/paywall-expansion.html [07:52:11] (03CR) 10Ema: [C: 03+1] trafficserver: remove comment about a server that won't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [07:53:16] !log run reprepro --delete clearvanished on apt1001 to clear all cloudera packages [07:53:22] moritzm: it is done! \o/ [07:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:48] very nice :-) [08:01:21] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] flaggedrevs.php: Use MediaWikiServices, not an extension function (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza) [08:15:44] (03CR) 10Jcrespo: [C: 04-1] "I suggested to do something like this for maintenance, but rlazarus mentioned it shouldn't be needed due to mailman2 being about to be dec" [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm) [08:22:24] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add hmonroy to deployment [puppet] - 10https://gerrit.wikimedia.org/r/679811 (https://phabricator.wikimedia.org/T280177) (owner: 10Filippo Giunchedi) [08:22:30] (03PS2) 10Filippo Giunchedi: admin: add hmonroy to deployment [puppet] - 10https://gerrit.wikimedia.org/r/679811 (https://phabricator.wikimedia.org/T280177) [08:22:36] (03PS1) 10Majavah: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299) [08:23:13] (03PS2) 10Majavah: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299) [08:24:52] 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10fgiunchedi) [08:26:17] (03PS3) 10Majavah: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-18 [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299) [08:26:37] 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10fgiunchedi) 05Open→03Resolved @HMonroy you are now a member of `deployment` group! Resolving task, please reopen if something is amiss. [08:28:48] (03CR) 10Gergő Tisza: "Assuming no train holdups, the extension patch lands on group2 on April 29, so this should be deployed May 3-ish." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza) [08:31:23] (03CR) 10Daniel Kinzler: flaggedrevs.php: Use MediaWikiServices, not an extension function (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza) [08:31:45] (03PS2) 10Filippo Giunchedi: admin: add awight to graphite-admins [puppet] - 10https://gerrit.wikimedia.org/r/679747 (https://phabricator.wikimedia.org/T280242) (owner: 10Awight) [08:33:10] (03PS1) 10David Caro: wmcs: Add link to runbook on puppet alerts. [puppet] - 10https://gerrit.wikimedia.org/r/680254 [08:33:25] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::analytics_cluster::hadoop::standby: move hadoop dirs to /srv [puppet] - 10https://gerrit.wikimedia.org/r/680179 (https://phabricator.wikimedia.org/T265126) (owner: 10Elukey) [08:34:17] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add awight to graphite-admins [puppet] - 10https://gerrit.wikimedia.org/r/679747 (https://phabricator.wikimedia.org/T280242) (owner: 10Awight) [08:34:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15373 and previous config saved to /var/cache/conftool/dbconfig/20210416-083431-root.json [08:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:47] 10SRE, 10SRE-Access-Requests, 10observability, 10Graphite, 10Patch-For-Review: Requesting access to graphite hosts for awight - https://phabricator.wikimedia.org/T280242 (10fgiunchedi) [08:37:21] 10SRE, 10SRE-Access-Requests, 10observability, 10Graphite, 10Patch-For-Review: Requesting access to graphite hosts for awight - https://phabricator.wikimedia.org/T280242 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is implemented now! @awight I've expanded a little https://wikitech.wikimedi... [08:38:36] (03PS1) 10Jcrespo: backups: move the exclude backups list for mailman to hiera [puppet] - 10https://gerrit.wikimedia.org/r/680255 (https://phabricator.wikimedia.org/T279237) [08:39:10] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10fgiunchedi) p:05Triage→03Medium [08:40:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:43:52] (03CR) 10Jcrespo: "This is a noop: https://puppet-compiler.wmflabs.org/compiler1001/29095/backup1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/680255 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo) [08:44:12] (03CR) 10Jcrespo: [C: 03+2] backups: move the exclude backups list for mailman to hiera [puppet] - 10https://gerrit.wikimedia.org/r/680255 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo) [08:44:37] 10SRE, 10netops: Allow bast1003 in management routers (and drop bast1002) - https://phabricator.wikimedia.org/T280253 (10fgiunchedi) p:05Triage→03Medium [08:44:58] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10fgiunchedi) p:05Triage→03High [08:45:07] 10SRE, 10Packaging: Package php-ast in {stretch,buster}-wikimedia/component - https://phabricator.wikimedia.org/T280210 (10fgiunchedi) p:05Triage→03Medium [08:45:22] 10SRE, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10fgiunchedi) p:05Triage→03Medium [08:45:46] 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10fgiunchedi) p:05Triage→03Medium [08:47:25] (03CR) 10Jcrespo: "So I've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/680255 which does the move of the list to hiera. You are now ready to " [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm) [08:49:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15374 and previous config saved to /var/cache/conftool/dbconfig/20210416-084935-root.json [08:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:02] (03PS1) 10Filippo Giunchedi: install_server: move thanos-fe to raid0 for /srv [puppet] - 10https://gerrit.wikimedia.org/r/680257 (https://phabricator.wikimedia.org/T280257) [08:52:09] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master: move hadoop dirs under /srv [puppet] - 10https://gerrit.wikimedia.org/r/680259 (https://phabricator.wikimedia.org/T265126) [08:54:34] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) [08:55:22] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29096/console" [puppet] - 10https://gerrit.wikimedia.org/r/680259 (https://phabricator.wikimedia.org/T265126) (owner: 10Elukey) [08:57:02] godog: Thanks, I'm able to log in! If you have one more moment to bump https://gerrit.wikimedia.org/r/c/679390 , that would unblock me on the next steps of actually deleting the metrics... [08:57:19] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Update rewrite rule for https://www.mediawiki.org/FAQ [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712) [08:57:41] (03CR) 10Jcrespo: "BTW, I asked rlazarus if there was a pattern on the url, and he didn't say yes-not sure because unsure or because there is really not a pa" [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm) [08:57:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:57:59] awight: sure, I'll merge that [08:58:10] (03CR) 10Filippo Giunchedi: [C: 03+2] Temporarily disable some reportupdater jobs [puppet] - 10https://gerrit.wikimedia.org/r/679390 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [08:59:46] awight: if you need any help with RU etc.. lemme know! [09:00:39] :-D Thanks godog and elukey for the help, and on a Friday even. [09:01:28] elukey: If it's an easy question to answer, do you know if reportupdater is quiescent at the moment? Then I can trust that my jobs are really disabled... [09:02:39] Ordinarily I would just wait a day, but in this case I'd like to keep momentum, there is a looming data retention deadline of May 1st... [09:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15380 and previous config saved to /var/cache/conftool/dbconfig/20210416-090438-root.json [09:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:20] awight: let's move to #analytics :) [09:05:37] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) a:03JMeybohm etcd cluster is set up now on conf200[4,5,6] although I had some trouble setting it up and I do not yet know why: After the initial puppet runs, the ectd's rejec... [09:06:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10ayounsi) @Cmjohnson asw2-a-eqiad and asw2-b-eqiad have outstanding changes, please make sure to commit them. `lang=diff Changes for 1 device... [09:09:29] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:09:29] PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.193 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:10:57] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:25] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [09:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:39] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) Do you think, with the work done, we could drop support of jessie bacula backups (only etcd cluster was pending with jessie)? [09:12:48] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) [09:13:30] !log imported envoyproxy_1.15.4-1 to buster-wikimedia - T280317 [09:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:01] (03PS2) 10Jcrespo: bacula: Revert TLS 1.0 downgrade on storage servers (including director) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) [09:14:26] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) [09:15:03] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) [09:16:13] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) >>! In T271573#7008274, @jcrespo wrote: > Do you think, with the work done, we could drop support of jessie bacula backups (only etcd cluster was pending with jessie)? If those... [09:17:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:17:52] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] flaggedrevs.php: Use MediaWikiServices, not an extension function (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza) [09:18:19] ACKNOWLEDGEMENT - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Gehel failed data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:19] ACKNOWLEDGEMENT - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.001 second response time Gehel failed data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:18:19] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.181 second response time Gehel failed data transfer - https://phabricator.wikimedia.org/T267927 https://wikit [09:18:20] /wiki/Wikidata_query_service/Runbook [09:18:48] 10SRE, 10SRE-Access-Requests, 10observability, 10Graphite: Requesting access to graphite hosts for awight - https://phabricator.wikimedia.org/T280242 (10awight) @fgiunchedi Thanks, I was able to start making these deletions :-) [09:19:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15383 and previous config saved to /var/cache/conftool/dbconfig/20210416-091942-root.json [09:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:16] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2009.codfw.wmnet [09:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:24:50] (03PS2) 10Filippo Giunchedi: install_server: move thanos-fe to raid0 for /srv [puppet] - 10https://gerrit.wikimedia.org/r/680257 (https://phabricator.wikimedia.org/T280257) [09:25:17] seeking an easy +1 for ^ [09:27:51] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2009.codfw.wmnet [09:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:30:12] godog: having a look in a bit [09:30:54] moritzm: thank you! appreciate it [09:31:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/680257 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi) [09:33:02] (03PS1) 10Awight: [DNM] Revert "Temporarily disable some reportupdater jobs" [puppet] - 10https://gerrit.wikimedia.org/r/680021 (https://phabricator.wikimedia.org/T279046) [09:33:33] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2010.codfw.wmnet [09:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:41] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: move thanos-fe to raid0 for /srv [puppet] - 10https://gerrit.wikimedia.org/r/680257 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi) [09:34:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15384 and previous config saved to /var/cache/conftool/dbconfig/20210416-093446-root.json [09:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:04] (03PS1) 10JMeybohm: New envoy upstream version 1.15.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/680265 (https://phabricator.wikimedia.org/T280317) [09:37:38] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] New envoy upstream version 1.15.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/680265 (https://phabricator.wikimedia.org/T280317) (owner: 10JMeybohm) [09:39:30] (03PS1) 10David Caro: wmsc: add role to the hiera hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) [09:40:04] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro) [09:40:56] (03PS2) 10David Caro: wmsc: add role to the hiera hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) [09:40:59] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2010.codfw.wmnet [09:40:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove Python 2 packages on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/679768 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [09:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:47] (03PS3) 10David Caro: wmsc: add role to the hiera hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) [09:43:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=etcd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:47:11] RECOVERY - Thanos compact has not run on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:47:27] RECOVERY - Thanos compact is halted on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:52:50] PROBLEM - Thanos compact has disappeared from Prometheus discovery on alert1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [09:55:15] !log imported envoyproxy_1.15.4-1 to stretch-wikimedia - T280317 [09:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:40] the thanos alerts are expected [09:57:59] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2011.codfw.wmnet [09:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:13] (03CR) 10Jbond: "I think this is a great improvement and would love to see it merged. However I'm cautions as i suspect there will be lots of local projec" [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro) [09:59:51] (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro) [10:00:18] !log updated envoyproxy to 1.15.4-1 on mwdebug1001.eqiad.wmnet [10:00:21] (03CR) 10Jbond: "> FYI" [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro) [10:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:02] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10MoritzMuehlenhoff) @crusnov Could you please take modules/raid/files/check-raid.py with precedence? It's part of a Bullseye b... [10:03:11] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: REIMAGE [10:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:53] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2011.codfw.wmnet [10:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:19] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: REIMAGE [10:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:06] (03PS2) 10Arturo Borrero Gonzalez: cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) [10:08:13] !log updated envoyproxy to 1.15.4-1 on mw1325.eqiad.wmnet,restbase1026.eqiad.wmnet [10:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:33] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2012.codfw.wmnet [10:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:18] (03PS2) 10Ayounsi: Merge all system.conf templates in one [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345) [10:10:20] (03PS1) 10Ayounsi: Remove dump-on-panic [homer/public] - 10https://gerrit.wikimedia.org/r/680276 (https://phabricator.wikimedia.org/T269345) [10:11:44] (03CR) 10Ayounsi: [C: 04-1] cr/firewall: add kafka-logging servers to labs-in filters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez) [10:12:22] (03CR) 10Alexandros Kosiaris: "The change is fine, but it's a noop. The thing changed here is the fixtures for CI. What you want to change is https://gerrit.wikimedia.or" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [10:12:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] Envoy: set per_try_timeout for eventgate-main. [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [10:13:12] (03PS3) 10Arturo Borrero Gonzalez: cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) [10:13:25] (03CR) 10Ayounsi: [C: 03+2] Remove dump-on-panic [homer/public] - 10https://gerrit.wikimedia.org/r/680276 (https://phabricator.wikimedia.org/T269345) (owner: 10Ayounsi) [10:13:39] (03CR) 10Arturo Borrero Gonzalez: cr/firewall: add kafka-logging servers to labs-in filters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez) [10:14:07] (03Merged) 10jenkins-bot: Remove dump-on-panic [homer/public] - 10https://gerrit.wikimedia.org/r/680276 (https://phabricator.wikimedia.org/T269345) (owner: 10Ayounsi) [10:15:54] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2012.codfw.wmnet [10:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:33] (03PS1) 10JMeybohm: citoid: Update envoy image to 1.15.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/680280 (https://phabricator.wikimedia.org/T280317) [10:19:48] PROBLEM - Thanos store has high percentage of object storage failures on alert1001 is CRITICAL: job=thanos-compact https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [10:20:47] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2013.codfw.wmnet [10:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:04] RECOVERY - Thanos store has high percentage of object storage failures on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [10:21:08] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.496e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [10:27:24] (03CR) 10Arturo Borrero Gonzalez: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-18 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299) (owner: 10Majavah) [10:28:11] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2013.codfw.wmnet [10:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:29] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2014.codfw.wmnet [10:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:06] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01438 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [10:36:12] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] citoid: Update envoy image to 1.15.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/680280 (https://phabricator.wikimedia.org/T280317) (owner: 10JMeybohm) [10:36:28] RECOVERY - Thanos compact has disappeared from Prometheus discovery on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview [10:37:30] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] wmcs: Add link to runbook on puppet alerts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680254 (owner: 10David Caro) [10:37:54] (03PS1) 10Majavah: aptrepo: Remove unused thirdparty/kubeadm-k8s-1-1[56] [puppet] - 10https://gerrit.wikimedia.org/r/680285 [10:38:25] (03CR) 10Majavah: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-18 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299) (owner: 10Majavah) [10:38:28] (03Merged) 10jenkins-bot: citoid: Update envoy image to 1.15.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/680280 (https://phabricator.wikimedia.org/T280317) (owner: 10JMeybohm) [10:39:06] (03CR) 10Vgutierrez: [C: 03+1] "From https://packages.debian.org/buster/bacula-common it looks like bacula uses OpenSSL 1.1 so I'm wondering if we shouldn't bump the mini" [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [10:39:08] (03CR) 10Ayounsi: [C: 03+1] cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez) [10:39:50] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2014.codfw.wmnet [10:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:11] 10SRE: Traceback in icinga-status 'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) 05Open→03Resolved [10:40:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez) [10:40:38] 10Puppet, 10SRE, 10puppet-compiler, 10User-jbond: in puppet 6 some core types have been moved to external modules. check and confirm our exposure - https://phabricator.wikimedia.org/T265143 (10jbond) p:05Triage→03Medium [10:41:02] (03Merged) 10jenkins-bot: cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez) [10:42:46] 10Puppet, 10SRE, 10puppet-compiler, 10User-jbond: in puppet 6 some core types have been moved to external modules. check and confirm our exposure - https://phabricator.wikimedia.org/T265143 (10jbond) puppet6 is not going to make it to bullseye so this issues is less urgent for now, that siad it looks like... [10:43:13] 10Puppet, 10SRE, 10User-jbond: Update puppet infrastructure latest 5.5 version - https://phabricator.wikimedia.org/T265139 (10jbond) 05Open→03Resolved [10:43:15] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [10:43:54] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: hiera_lookup failing to preform lookups after hiera5 upgrade - https://phabricator.wikimedia.org/T258931 (10jbond) p:05Triage→03Medium [10:44:48] !log merging homer change to cr-eqiad (T279342) [10:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:56] T279342: Migrate colocated kafka-logging brokers to dedicated kafka-logging hosts - https://phabricator.wikimedia.org/T279342 [10:46:46] PROBLEM - cassandra-a SSL 10.192.32.22:7001 on restbase2015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [10:47:10] 10Puppet, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: puppet populate failing on some nodes - https://phabricator.wikimedia.org/T248169 (10jbond) 05Open→03Resolved This is now possible using the `cumin:` [[ https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Host_variable_override | Host... [10:47:18] PROBLEM - cassandra-c CQL 10.192.32.105:9042 on restbase2015 is CRITICAL: connect to address 10.192.32.105 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [10:47:38] PROBLEM - cassandra-c SSL 10.192.32.105:7001 on restbase2015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [10:47:52] PROBLEM - cassandra-a CQL 10.192.32.22:9042 on restbase2015 is CRITICAL: connect to address 10.192.32.22 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [10:48:28] PROBLEM - cassandra-b SSL 10.192.32.25:7001 on restbase2015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [10:48:48] PROBLEM - cassandra-b CQL 10.192.32.25:9042 on restbase2015 is CRITICAL: connect to address 10.192.32.25 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [10:49:20] ^ that's me [10:49:29] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2015.codfw.wmnet [10:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:56] (03PS6) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) [10:55:12] RECOVERY - cassandra-c CQL 10.192.32.105:9042 on restbase2015 is OK: TCP OK - 0.033 second response time on 10.192.32.105 port 9042 https://phabricator.wikimedia.org/T93886 [10:55:12] RECOVERY - cassandra-b CQL 10.192.32.25:9042 on restbase2015 is OK: TCP OK - 0.033 second response time on 10.192.32.25 port 9042 https://phabricator.wikimedia.org/T93886 [10:55:12] RECOVERY - cassandra-a CQL 10.192.32.22:9042 on restbase2015 is OK: TCP OK - 0.033 second response time on 10.192.32.22 port 9042 https://phabricator.wikimedia.org/T93886 [10:55:12] RECOVERY - cassandra-c SSL 10.192.32.105:7001 on restbase2015 is OK: SSL OK - Certificate restbase2015-c valid until 2022-10-08 10:53:48 +0000 (expires in 539 days) https://phabricator.wikimedia.org/T120662 [10:55:12] RECOVERY - cassandra-b SSL 10.192.32.25:7001 on restbase2015 is OK: SSL OK - Certificate restbase2015-b valid until 2022-10-08 10:53:45 +0000 (expires in 539 days) https://phabricator.wikimedia.org/T120662 [10:55:13] RECOVERY - cassandra-a SSL 10.192.32.22:7001 on restbase2015 is OK: SSL OK - Certificate restbase2015-a valid until 2022-10-08 10:53:43 +0000 (expires in 539 days) https://phabricator.wikimedia.org/T120662 [10:55:25] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2015.codfw.wmnet [10:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:56] 10SRE, 10netops: Allow bast1003 in management routers (and drop bast1002) - https://phabricator.wikimedia.org/T280253 (10ayounsi) 05Open→03Resolved a:03ayounsi Done. [10:58:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: Remove unused thirdparty/kubeadm-k8s-1-1[56] [puppet] - 10https://gerrit.wikimedia.org/r/680285 (owner: 10Majavah) [10:58:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/680285 (owner: 10Majavah) [11:01:17] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2016.codfw.wmnet [11:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:55] (03CR) 10David Caro: wmcs: Add link to runbook on puppet alerts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680254 (owner: 10David Caro) [11:02:31] !log imported ferm 2.5.1-1+wmf1 to bullseye-wikimedia/main T275873 [11:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:41] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [11:04:26] (03PS4) 10Majavah: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-18 [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299) [11:08:19] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2016.codfw.wmnet [11:08:21] (03CR) 10Jcrespo: "vgutierrez: feel free to upgrade all client hosts to buster first :-) The downgrade to 1.0 was needed to support jessie hosts. We still ne" [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [11:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:34] (03CR) 10Jcrespo: "Also please help implement etcdv3/zookeeper backups so they can move away from jessie:" [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [11:17:57] (03PS1) 10Muehlenhoff: Update aliases for Hue [puppet] - 10https://gerrit.wikimedia.org/r/680290 [11:31:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "I'll merge the code including the compat chunk for the old domain. It should be a NOOP if all nodes and config files are already using the" [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [11:39:41] (03PS1) 10Muehlenhoff: Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) [11:40:35] (03CR) 10jerkins-bot: [V: 04-1] Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff) [11:56:53] (03PS2) 10Muehlenhoff: Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) [11:57:29] (03CR) 10jerkins-bot: [V: 04-1] Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff) [12:00:35] (03CR) 10Jcrespo: "One thing, let's add the current test host job names to the list of ignorelist for production monitoring at: https://phabricator.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff) [12:06:20] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [12:09:18] (03PS1) 10Urbanecm: testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680302 (https://phabricator.wikimedia.org/T279853) [12:09:20] (03PS1) 10Urbanecm: testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680303 (https://phabricator.wikimedia.org/T279853) [12:09:22] (03PS1) 10Urbanecm: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680304 (https://phabricator.wikimedia.org/T279853) [12:09:48] (03PS1) 10Jcrespo: backups: Disable bacula monintoring for sretest hosts [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) [12:10:10] (03PS2) 10Jcrespo: backups: Disable bacula monintoring for sretest hosts [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) [12:10:37] (03PS3) 10Jcrespo: backups: Disable bacula monitoring for sretest hosts [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) [12:11:03] (03CR) 10Urbanecm: [C: 04-2] "not now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680303 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [12:11:07] (03CR) 10Urbanecm: [C: 04-2] "not now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680304 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [12:12:07] (03CR) 10Jcrespo: [C: 04-1] "Only blocked on the specific day of the week in which they end up setup (currently, Friday)- which is deterministic but spread out among t" [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) (owner: 10Jcrespo) [12:12:19] (03CR) 10Urbanecm: [C: 03+1] "this sounds like a good thing to do" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza) [12:14:30] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1060-production-search-eqiad on elastic1060 is CRITICAL: 135.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-eqiad&var-instance=elastic1060&panelId=37 [12:15:26] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [12:15:26] should recover in a sec ^ [12:21:18] (03PS1) 10Filippo Giunchedi: swift: fix reimage race on /var/log/swift [puppet] - 10https://gerrit.wikimedia.org/r/680307 [12:21:20] (03PS1) 10Filippo Giunchedi: thanos: extend retention to 5y [puppet] - 10https://gerrit.wikimedia.org/r/680308 [12:21:22] (03PS1) 10Filippo Giunchedi: swift: hide diffs for files with sensitive data [puppet] - 10https://gerrit.wikimedia.org/r/680309 (https://phabricator.wikimedia.org/T280257) [12:22:21] !log updated envoyproxy to 1.15.4-1 on 'A:mw-canary or A:restbase-canary' [12:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:50] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [12:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:12] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 24717 bytes in 3.376 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:31:28] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 24706 bytes in 0.558 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:31:58] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 24705 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:37:18] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [12:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:39] (03PS1) 10JMeybohm: Revert "citoid: Update envoy image to 1.15.4-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/680024 [12:39:02] (03PS2) 10JMeybohm: Revert "citoid: Update envoy image to 1.15.4-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/680024 (https://phabricator.wikimedia.org/T280317) [12:41:02] (03CR) 10Urbanecm: [C: 03+1] "this should be good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676930 (https://phabricator.wikimedia.org/T214139) (owner: 10Amire80) [12:41:18] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2017.codfw.wmnet [12:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:59] (03PS3) 10Jbond: Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff) [12:46:21] (03CR) 10Jcrespo: "sorry, wrong patch." [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff) [12:47:12] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2017.codfw.wmnet [12:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:43] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [12:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:26] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Manuel) @KFrancis Thank you for sending the document! I just reviewed and signed. [12:48:37] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2018.codfw.wmnet [12:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:02] 10Puppet, 10SRE, 10User-jbond: wmf-stylguid checks: unable to ignore violations inside roles - https://phabricator.wikimedia.org/T280353 (10jbond) p:05Triage→03Low [12:54:34] (03PS2) 10Elukey: Update aliases for Hue [puppet] - 10https://gerrit.wikimedia.org/r/680290 (owner: 10Muehlenhoff) [12:54:51] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2018.codfw.wmnet [12:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:01] (03CR) 10Elukey: [C: 03+2] "Thank youuu I missed one!" [puppet] - 10https://gerrit.wikimedia.org/r/680290 (owner: 10Muehlenhoff) [12:55:27] (03CR) 10Elukey: [V: 03+2 C: 03+2] Update aliases for Hue [puppet] - 10https://gerrit.wikimedia.org/r/680290 (owner: 10Muehlenhoff) [12:55:33] (03CR) 10Elukey: [C: 03+2] Update aliases for Hue [puppet] - 10https://gerrit.wikimedia.org/r/680290 (owner: 10Muehlenhoff) [12:55:53] (03PS4) 10Muehlenhoff: Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) [12:59:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2019.codfw.wmnet [12:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:52] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1060-production-search-eqiad on elastic1060 is OK: (C)100 gt (W)80 gt 16.26 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-eqiad&var-instance=elastic1060&panelId=37 [13:07:33] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2019.codfw.wmnet [13:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:57] jayme: FYI prometheus can't scrape etcd metrics on conf200[456]:4001 since earlier today [13:11:09] godog: oh, thanks. Will take a look [13:13:02] np! [13:20:14] (03CR) 10Amire80: "Thanks for +1!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676930 (https://phabricator.wikimedia.org/T214139) (owner: 10Amire80) [13:21:33] (03CR) 10Muehlenhoff: [C: 03+2] Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff) [13:25:02] (03PS1) 10Ladsgroup: exim: Add support for handling mailman3 inside mailman2 conf [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) [13:25:30] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:26:10] (03CR) 10jerkins-bot: [V: 04-1] exim: Add support for handling mailman3 inside mailman2 conf [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [13:26:21] (03PS4) 10Jcrespo: backups: Disable bacula monitoring for sretest hosts [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) [13:26:32] (03PS5) 10Jcrespo: backups: Disable bacula monitoring for sretest hosts [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) [13:27:27] (03PS6) 10Jcrespo: backups: Disable bacula monitoring for sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) [13:27:36] (03PS2) 10Ladsgroup: exim: Add support for handling mailman3 inside mailman2 conf [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) [13:29:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) (owner: 10Jcrespo) [13:29:49] (03CR) 10Jcrespo: [C: 03+2] backups: Disable bacula monitoring for sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) (owner: 10Jcrespo) [13:31:31] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [13:34:35] (03PS1) 10David Caro: WIP wmcs.enc: Add role of the machine [puppet] - 10https://gerrit.wikimedia.org/r/680329 (https://phabricator.wikimedia.org/T280324) [13:39:58] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:56:31] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) The tlsproxy currently serves a certificate not valid for conf200[4,5,6] (Prometheus errors with: `Get https://conf2004:4001/metrics: x509: certificate is valid for conf2001.cod... [14:01:21] (03CR) 10Muehlenhoff: systemd::timer::job: update mailing script with additional options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [14:06:14] (03PS1) 10Ladsgroup: exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) [14:06:26] (03CR) 10jerkins-bot: [V: 04-1] exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [14:07:41] (03PS2) 10Ladsgroup: exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) [14:08:28] (03PS1) 10Herron: Revert "kafka-logging1003: disable notifications during setup" [puppet] - 10https://gerrit.wikimedia.org/r/680025 [14:08:37] (03CR) 10jerkins-bot: [V: 04-1] exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [14:09:53] (03PS3) 10Ladsgroup: exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) [14:11:03] (03CR) 10jerkins-bot: [V: 04-1] exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [14:11:29] (03CR) 10Herron: [C: 03+2] Revert "kafka-logging1003: disable notifications during setup" [puppet] - 10https://gerrit.wikimedia.org/r/680025 (owner: 10Herron) [14:12:08] (03PS1) 10Daimona Eaytoy: Relax CSP rules for doc.wm.org/mediawiki-tools-phan-SecurityCheckPlugin [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) [14:14:53] (03PS4) 10Ladsgroup: exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) [14:18:18] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2020.codfw.wmnet [14:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:57] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2020.codfw.wmnet [14:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:51] PROBLEM - cassandra-a SSL 10.192.16.153:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:28:23] PROBLEM - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:29:15] PROBLEM - cassandra-b SSL 10.192.16.154:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:29:19] PROBLEM - cassandra-c SSL 10.192.16.155:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [14:31:19] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2021.codfw.wmnet [14:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:51] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 76 probes of 620 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:37:09] RECOVERY - cassandra-a SSL 10.192.16.153:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-a valid until 2022-01-15 15:52:56 +0000 (expires in 274 days) https://phabricator.wikimedia.org/T120662 [14:37:09] RECOVERY - cassandra-b SSL 10.192.16.154:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-b valid until 2022-01-15 15:52:58 +0000 (expires in 274 days) https://phabricator.wikimedia.org/T120662 [14:37:15] RECOVERY - cassandra-c SSL 10.192.16.155:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-c valid until 2022-01-15 15:53:01 +0000 (expires in 274 days) https://phabricator.wikimedia.org/T120662 [14:38:53] RECOVERY - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is OK: TCP OK - 0.032 second response time on 10.192.16.153 port 9042 https://phabricator.wikimedia.org/T93886 [14:39:22] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2021.codfw.wmnet [14:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:35] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 620 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:42:35] (03PS3) 10Herron: replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) [14:43:06] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2022.codfw.wmnet [14:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:40] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) https://polymorphic.lists.wmcloud.org/pipermail/test-on-two/2021-April/000001.html https://polymorphic.lists.wmcloud.org/mailman3/postor... [14:44:42] (03CR) 10Herron: replace mwlog1001 with new mwlog[12]002 hosts (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [14:44:50] (03CR) 10jerkins-bot: [V: 04-1] replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [14:49:49] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2022.codfw.wmnet [14:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:15] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2023.codfw.wmnet [14:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:58] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on restbase-dev[1005-1006].eqiad.wmnet with reason: restarting for kernel update [14:51:59] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on restbase-dev[1005-1006].eqiad.wmnet with reason: restarting for kernel update [14:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:16] 10SRE: creation of raju@wikipedia.org for fundraising team - https://phabricator.wikimedia.org/T280371 (10MNoorWMF) [14:56:25] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:45] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:46] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on restbase-dev1006.eqiad.wmnet with reason: restarting for kernel update [14:58:48] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on restbase-dev1006.eqiad.wmnet with reason: restarting for kernel update [14:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:32] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2023.codfw.wmnet [14:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:26] (∩`-´)⊃━☆゚.*・。゚ ~*restbase spam complete*~ [15:09:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1016.wikime... [15:14:42] (03CR) 10Jbond: [C: 04-1] "Looks good just a minor issue, see comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680329 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro) [15:15:37] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/680368 [15:15:45] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/680368 (owner: 10Kosta Harlan) [15:19:04] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/680368 (owner: 10Kosta Harlan) [15:22:45] !log urbanecm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [15:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:19] (03PS8) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 [15:27:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1016.wikimedia.org'] ` Of which those **FAILED**: ` ['clou... [15:28:14] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) >>! In T247364#7008439, @MoritzMuehlenhoff wrote: > @crusnov Could you please take modules/raid/files/check-raid.py... [15:29:19] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) exim fully works. The only thing broken currently is the archiver. The apache config is being weird probably: ` WARNING 2021-04-16 15:25... [15:29:55] (03PS9) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292 [15:29:57] (03PS4) 10Jbond: check_cumin_aliases: ensure script exits 1 on error [puppet] - 10https://gerrit.wikimedia.org/r/679297 [15:30:18] (03PS7) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 [15:31:15] !log urbanecm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [15:31:15] !log urbanecm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [15:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:34] (03PS8) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293 [15:33:11] (03CR) 10Jbond: "updated thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [15:34:38] (03PS1) 10Ppchelko: Envoy: set per_try_timeout for eventgate-main. [puppet] - 10https://gerrit.wikimedia.org/r/680372 (https://phabricator.wikimedia.org/T249745) [15:36:37] (03CR) 10Ppchelko: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [15:36:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1016.wikime... [15:37:47] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) ` [Fri Apr 16 15:25:54.109507 2021] [proxy_http:debug] [pid 12009:tid 140593783097088] mod_proxy_http.c(1920): [client 172.16.4.88:59876]... [15:43:21] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [15:43:31] !log urbanecm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [15:43:32] !log urbanecm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [15:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:15] (03PS1) 10Ottomata: Update eventgate-logging-external kafka brokers list [deployment-charts] - 10https://gerrit.wikimedia.org/r/680375 (https://phabricator.wikimedia.org/T279342) [15:51:37] (03PS1) 10David Caro: icinga: allow clearing a downtime for a host [puppet] - 10https://gerrit.wikimedia.org/r/680376 [15:52:37] (03PS2) 10Ottomata: Update eventgate-logging-external kafka brokers list in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/680375 (https://phabricator.wikimedia.org/T279342) [15:52:46] (03CR) 10jerkins-bot: [V: 04-1] icinga: allow clearing a downtime for a host [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [15:53:36] (03CR) 10Herron: [C: 03+1] Update eventgate-logging-external kafka brokers list in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/680375 (https://phabricator.wikimedia.org/T279342) (owner: 10Ottomata) [15:54:23] (03PS2) 10David Caro: icinga: allow clearing a downtime for a host [puppet] - 10https://gerrit.wikimedia.org/r/680376 [15:55:27] (03PS1) 10Jbond: P:debmonitopr::client: add correct owner to ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/680378 [15:56:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29097/console" [puppet] - 10https://gerrit.wikimedia.org/r/680378 (owner: 10Jbond) [15:56:38] (03CR) 10Ottomata: [C: 03+2] Update eventgate-logging-external kafka brokers list in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/680375 (https://phabricator.wikimedia.org/T279342) (owner: 10Ottomata) [15:56:39] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [15:57:00] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitopr::client: add correct owner to ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/680378 (owner: 10Jbond) [15:57:28] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [15:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:37] 10SRE, 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) [15:58:31] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:39] 10SRE, 10Traffic: Protect our users against Google-driven privacy breach via FLOC - https://phabricator.wikimedia.org/T280377 (10Joe) [16:02:35] 10SRE, 10Traffic: Protect our users against Google-driven privacy breach via FLOC - https://phabricator.wikimedia.org/T280377 (10Joe) Please note that while we surely won't use the js api in our base javascript, this is intended as a defensive measure for all third-party js and software we run, [16:03:44] (03CR) 10Jcrespo: "Looking good, but let's battle test it next week (not on friday afternoon) for the edge cases (e.g. not requiring a message for removing d" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [16:05:16] (03PS1) 10Ottomata: eventgate-logging-external - update networkpolicy with new kafka brokers [deployment-charts] - 10https://gerrit.wikimedia.org/r/680380 (https://phabricator.wikimedia.org/T279342) [16:05:31] (03CR) 10Ottomata: "Ohp, this is needed too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/680380 (https://phabricator.wikimedia.org/T279342) (owner: 10Ottomata) [16:05:58] (03PS2) 10Ottomata: eventgate-logging-external - update networkpolicy with new kafka brokers [deployment-charts] - 10https://gerrit.wikimedia.org/r/680380 (https://phabricator.wikimedia.org/T279342) [16:06:16] 10SRE, 10Traffic: Protect our users against Google-driven privacy breach via FLOC - https://phabricator.wikimedia.org/T280377 (10Joe) [16:07:23] 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10Joe) [16:08:31] 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10Joe) >>! In T279804#7003701, @ori wrote: > Is the header needed at all? > > https://github.com/WICG/floc... [16:08:35] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - update networkpolicy with new kafka brokers [deployment-charts] - 10https://gerrit.wikimedia.org/r/680380 (https://phabricator.wikimedia.org/T279342) (owner: 10Ottomata) [16:09:14] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [16:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:46] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [16:13:47] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [16:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:14] (03PS1) 10Elukey: hadoop: improve default log4j config [puppet] - 10https://gerrit.wikimedia.org/r/680383 (https://phabricator.wikimedia.org/T276906) [16:25:47] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10JKatzWMF) Not sure why it didn't work, but thanks for walking through it with me yesterday. Added eyener@wikimedia.org to the domains you requested, @Pco... [16:25:48] (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [16:37:47] 10SRE, 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) Actually, I'm not sure even just doing LVS would help here. The helmfiles networkpolicy explicitly lists IP addresses that the servic... [16:52:59] (03CR) 10Ladsgroup: [C: 04-1] "It doesn't fix the issue." [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [16:55:28] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Schlurcher) Thanks for adding me as a subscriber, as my bot apparently caused this issue. The bot has been performing these actions with an edit rat... [16:56:34] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Jdforrester-WMF) [16:59:59] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [17:00:00] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [17:00:02] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [17:00:04] !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [17:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:23] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [17:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:36] !log T267927 Following data transfers complete: `wdqs1004`->`wdqs1007`, `wdqs2001`->`wdqs2003`, `wdqs1003`->`wdqs1008`, `wdqs2008`->`wdqs2004` [17:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:44] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [17:03:46] !log T267927 Pooled `wdqs1007`, `wdqs2003`, `wdqs1008`, `wdqs2004` [17:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:53] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:07:03] PROBLEM - puppet last run on wdqs2004 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:07:03] PROBLEM - puppet last run on wdqs2003 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:07:53] PROBLEM - puppet last run on wdqs1008 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:08:35] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:43] PROBLEM - puppet last run on wdqs2008 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:13:23] RECOVERY - puppet last run on wdqs2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:13:23] RECOVERY - puppet last run on wdqs2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:14:11] RECOVERY - puppet last run on wdqs1008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:15:37] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1016.wikime... [17:22:45] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:19] 10SRE, 10Privacy Engineering, 10Traffic, 10fundraising-tech-ops, and 2 others: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10Dwisehaupt) [17:23:25] RECOVERY - puppet last run on wdqs2008 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:29:43] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:17] (03CR) 10Dzahn: [C: 03+2] "just a comment" [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [17:30:24] (03PS4) 10Dzahn: trafficserver: remove comment about a server that won't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203) [17:31:37] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1016.wikimedia.org with reason: REIMAGE [17:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:52] 10SRE, 10Privacy Engineering, 10Traffic, 10fundraising-tech-ops, and 2 others: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10Dwisehaupt) Header added to fundraising nginx templates and deployed. ` [frack::puppet] 58ed92cf Add... [17:32:06] (03PS1) 10Dzahn: conftool/DCHP: decom mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/680391 [17:32:11] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:33:11] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1017.wikimedia.org with reason: REIMAGE [17:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:47] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1016.wikimedia.org with reason: REIMAGE [17:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:00] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Legoktm) >>! In T280232#7009873, @Schlurcher wrote: > Thanks for adding me as a subscriber, as my bot apparently caused this issue. The bot has been... [17:34:21] (03PS1) 10Dzahn: trafficserver: remove mwdebug1003 from x-wikimedia-debug-routing [puppet] - 10https://gerrit.wikimedia.org/r/680393 [17:34:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mwdebug1003.eqiad.wmnet [17:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:19] !log depooling mwdebug1003 (stretch VM, will be removed), mwdebug1001/1002 (buster) and unchanged [17:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:46] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1017.wikimedia.org with reason: REIMAGE [17:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) cloudcephods1018: Broadcom UNDI PXE-2.1 v21.6.0 Copyright (C) 2000-2020 Broadcom Corporation Copyright (C) 1997-2000 Inte... [17:37:11] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1019.wikimedia.org with reason: REIMAGE [17:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [17:39:15] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1019.wikimedia.org with reason: REIMAGE [17:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:38] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1020.wikimedia.org with reason: REIMAGE [17:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:43] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1020.wikimedia.org with reason: REIMAGE [17:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:47] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10KFrancis) @Manuel I am confirming receipt of the signed NDA. Please proceed with next steps. Thanks! [17:47:47] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [17:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:15] !log T267927 Transferring from `wdqs2008`->`wdqs2003` to resolve the data corruption on `wdqs2003` [17:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:22] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [17:49:19] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Dzahn) Thanks @KFrancis ! @Manuel please register a user on the https://wikitech.wikimedia.org wiki and let us know the username you picked. Then we can add you to the LDAP groups.... [17:49:58] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10Pcoombe) @JKatzWMF Thanks. Can you add Erin to the mobile *.m.wikipedia subdomains as well? Sorry to be a pain! [17:53:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [17:55:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) a:05RobH→03Jclark-ctr >>! In T274945#7010007, @RobH wrote: > cloudcephods1018: > > Broadcom UNDI PXE-2.1 v21.6.0 > Co... [17:59:47] (03PS1) 10Cwhite: logstash: limit apifeatureusage curator job to jobs_host [puppet] - 10https://gerrit.wikimedia.org/r/680399 (https://phabricator.wikimedia.org/T274394) [18:01:40] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1003/29098/" [puppet] - 10https://gerrit.wikimedia.org/r/680399 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [18:05:39] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jclark-ctr) [18:05:51] (03CR) 10Cwhite: "I suspect there is an issue with the apifeatureusage forcemerge action. Due to all the instances erroring simultaneously for different re" [puppet] - 10https://gerrit.wikimedia.org/r/680399 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [18:07:18] (03CR) 10Cwhite: [C: 03+1] swift: fix reimage race on /var/log/swift [puppet] - 10https://gerrit.wikimedia.org/r/680307 (owner: 10Filippo Giunchedi) [18:07:39] (03CR) 10Cwhite: [C: 03+1] swift: hide diffs for files with sensitive data [puppet] - 10https://gerrit.wikimedia.org/r/680309 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi) [18:08:01] (03CR) 10Cwhite: [C: 03+1] thanos: extend retention to 5y [puppet] - 10https://gerrit.wikimedia.org/r/680308 (owner: 10Filippo Giunchedi) [18:09:47] (03Abandoned) 10Cwhite: logstash: use curator cluster config when possible [puppet] - 10https://gerrit.wikimedia.org/r/676631 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [18:10:03] (03Abandoned) 10Cwhite: logstash: set cluster name for elasticsearch outputs [puppet] - 10https://gerrit.wikimedia.org/r/676685 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [18:10:28] (03Abandoned) 10Cwhite: logstash: use logstash output to manage ecs-test indexes [puppet] - 10https://gerrit.wikimedia.org/r/676687 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [18:10:54] (03Abandoned) 10Cwhite: logstash: add curator config to manage w3creportingapi revision 1 indexes [puppet] - 10https://gerrit.wikimedia.org/r/676690 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [18:10:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10Jclark-ctr) @Cmjohnson my mistake it was C5 for 1003. netbox was correct though [18:11:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [18:13:01] 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Keegan) [18:13:08] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [18:13:25] 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Keegan) 05Open→03Resolved [18:14:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10Jclark-ctr) [18:16:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10Jclark-ctr) [18:17:38] (03PS21) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [18:17:39] (03CR) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [18:21:16] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10JKatzWMF) @Pcoombe my bad. done now. [18:21:43] (03CR) 10CRusnov: "I may have addressed the comments." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [18:24:30] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10EYener) Thanks @JKatzWMF - I'm seeing the mobile subdomains now! And thanks @Pcoombe for all your help here. [18:27:25] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [18:34:49] (03PS22) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [18:41:05] (03PS1) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [18:42:13] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov) [18:46:54] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Rename deployment-cache-(text|upload)0x to deployment-cp0x - https://phabricator.wikimedia.org/T280393 (10Krinkle) p:05Triage→03Low Yeah, there's no rush definitely. Just something to keep in mind for next time there's something to do around these instances. [18:53:16] (03PS2) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [19:00:29] (03PS3) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [19:06:48] (03PS4) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [19:16:30] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [19:16:46] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:42] RECOVERY - WDQS SPARQL on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.211 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:18:42] (03PS5) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [19:23:46] (03PS6) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [20:40:33] !log reindexing wikidata on cloudelastic... AGAIN (T274200) [20:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:42] T274200: Reindex English and Italian wikis to enable homoglyph plugin - https://phabricator.wikimedia.org/T274200 [20:47:04] (03CR) 10BBlack: [C: 03+2] varnish: add anti-FLoC header to responses [puppet] - 10https://gerrit.wikimedia.org/r/679866 (https://phabricator.wikimedia.org/T279804) (owner: 10Dave Pifke) [20:54:49] (03PS23) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) [21:09:26] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 147.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [21:09:34] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is CRITICAL: 138.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37 [21:09:53] 10SRE: Role with quote in description causes bash syntax error - https://phabricator.wikimedia.org/T276868 (10razzi) 05Open→03Resolved a:03razzi This has been fixed! [21:40:36] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 114.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [21:47:56] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for ptwikinews - https://phabricator.wikimedia.org/T280408 (10Peachey88) [21:52:46] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [22:03:24] 10SRE, 10Privacy Engineering, 10Traffic, 10fundraising-tech-ops, and 2 others: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10dpifke) 05Open→03Resolved a:03dpifke [22:04:34] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:09:38] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:41:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10wiki_willy) Hi @Jclark-ctr - there's a Netbox error associated with these serial numbers also. Looks like cloudcephosd1019 and... [22:50:50] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [23:34:18] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [23:36:43] (03PS1) 10Dzahn: DHCP: switch mw1307 to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/680483 (https://phabricator.wikimedia.org/T245757) [23:36:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw[1402-1403].eqiad.wmnet with reason: reimage [23:36:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw[1402-1403].eqiad.wmnet with reason: reimage [23:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:11] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [23:39:15] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [23:39:41] !log reimaging last 3 remaining stretch appservers with buster, mw1307, mw1402, mw1403 [23:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:00] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [23:41:34] (03CR) 10Dzahn: [C: 03+2] DHCP: switch mw1307 to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/680483 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [23:47:28] !log decom'ing mwdebug1003, stretch VM created in T267248 [23:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:37] T267248: create mwdebug1003 - ganeti VM with buster and appserver role - https://phabricator.wikimedia.org/T267248 [23:47:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mwdebug1003.eqiad.wmnet [23:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:46] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mwdebug1003.eqiad.wmnet [23:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:53] (03CR) 10Dzahn: [C: 03+2] conftool/DCHP: decom mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/680391 (owner: 10Dzahn) [23:49:59] (03PS2) 10Dzahn: conftool/DCHP: decom mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/680391 [23:52:09] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [23:53:28] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) Remaining 3 special cases kept on stretch now reimaged to buster as well. Decom'ed mwdebug1... [23:56:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1402.eqiad.wmnet with reason: REIMAGE [23:56:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1403.eqiad.wmnet with reason: REIMAGE [23:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1402.eqiad.wmnet with reason: REIMAGE [23:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mwdebug1003.eqiad.wmnet [23:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:16] (03PS2) 10Dzahn: trafficserver: remove mwdebug1003 from x-wikimedia-debug-routing [puppet] - 10https://gerrit.wikimedia.org/r/680393 (https://phabricator.wikimedia.org/T267248)