[00:06:38] <thcipriani>	 it's funny that jouncebot does the parsing, I made a thing that does the calendaring *from* json to wikitext so I don't have to generate the calendar by hand
[00:07:15] <thcipriani>	 which is to say: deployment calendar as a wikipage means we make weird things to genrate it and weird things to read it
[00:09:27] <logmsgbot>	 !log jforrester@deploy1002 Started deploy [integration/docroot@63b6fb6]: Sync with CI updates (no-op)
[00:09:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:09:35] <logmsgbot>	 !log jforrester@deploy1002 Finished deploy [integration/docroot@63b6fb6]: Sync with CI updates (no-op) (duration: 00m 08s)
[00:09:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:32] <James_F>	 Actual automated conch management/pinging/deployment tooling would be very nice.
[00:13:15] <Krinkle>	 thcipriani: I guess around these parts we measure cyclomatic complexity by the number of times data tranforms into and out of wikitext.
[00:13:33] <thcipriani>	 :D
[00:16:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:23:00] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:26:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1018.wikimedia.org', 'cloudcephosd1016.wikimedia.org', 'clo...
[00:26:55] <wikibugs>	 (03PS1) 10BryanDavis: Reparse deploy page before announcing an event [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/680033 (https://phabricator.wikimedia.org/T243394)
[00:30:57] <Krinkle>	 !log Delete old data at doc1001:/srv/doc/cover/PasswordBlacklist (ref T254799)
[00:31:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:31:06] <stashbot>	 T254799: Rename wikimedia/password-blacklist library - https://phabricator.wikimedia.org/T254799
[00:32:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:33:40] <DannyS712>	 Krinkle thanks
[00:38:02] <DannyS712>	 its gone from https://doc.wikimedia.org/cover/ but still exists at https://doc.wikimedia.org/cover/mediawiki-libs-PasswordBlacklist/
[00:38:48] <Krinkle>	 DannyS712: cdn cache 
[00:38:54] <Krinkle>	 will expire on its own
[00:38:59] <DannyS712>	 okay
[00:39:03] <Krinkle>	 https://doc.wikimedia.org/cover/mediawiki-libs-PasswordBlacklist/?_not_here
[00:39:36] <DannyS712>	 the main page also now lists CLDRPluralRuleParser, which I don' think was there earlier - going to go write some tests for that...
[00:41:12] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.109 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[00:41:44] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[00:42:26] <Krinkle>	 "Generated … Fri Mar 26  2021"
[00:42:42] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:43] <DannyS712>	 in that case, I guess I just missed it :)
[00:43:18] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:44:40] <wikibugs>	 (03PS1) 10Bstorm: gridengine: set additional grid-configurator source files to new domain [puppet] - 10https://gerrit.wikimedia.org/r/680038 (https://phabricator.wikimedia.org/T277653)
[00:45:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:50:56] <wikibugs>	 (03PS2) 10Bstorm: gridengine: set additional grid-configurator source files to new domain [puppet] - 10https://gerrit.wikimedia.org/r/680038 (https://phabricator.wikimedia.org/T277653)
[00:53:32] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] gridengine: set additional grid-configurator source files to new domain [puppet] - 10https://gerrit.wikimedia.org/r/680038 (https://phabricator.wikimedia.org/T277653) (owner: 10Bstorm)
[01:06:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:17:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:40:14] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[01:47:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:52:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:05:00] <icinga-wm>	 PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:42:55] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: int can't take in float as string [cookbooks] - 10https://gerrit.wikimedia.org/r/680095 (https://phabricator.wikimedia.org/T280108)
[02:47:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:53:10] <wikibugs>	 (03PS1) 10Andrew Bogott: Update trove.conf as per Ussuri release notes [puppet] - 10https://gerrit.wikimedia.org/r/680102 (https://phabricator.wikimedia.org/T212595)
[02:54:16] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Update trove.conf as per Ussuri release notes [puppet] - 10https://gerrit.wikimedia.org/r/680102 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott)
[02:54:59] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 (owner: 10Krinkle)
[02:55:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 (owner: 10Krinkle)
[02:55:50] <wikibugs>	 (03PS4) 10Krinkle: [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733
[02:56:01] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 (owner: 10Krinkle)
[02:56:38] <wikibugs>	 (03PS4) 10Krinkle: mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734
[02:56:54] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta Cluster] mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677733 (owner: 10Krinkle)
[03:04:24] <logmsgbot>	 !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[03:04:25] <logmsgbot>	 !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[03:04:29] <logmsgbot>	 !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[03:04:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:04:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:05:21] <ryankemper>	 !log T267927 Last round of `data-transfer`s finished successfully, proceeding to next round
[03:05:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:05:29] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[03:09:21] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[03:09:22] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[03:09:24] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[03:09:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:09:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:09:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:09:42] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[03:09:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:09:50] <ryankemper>	 !log T267927 kicked off next round of `data-transfer`s: `wdqs1004`->`wdqs1007`, `wdqs2001`->`wdqs2003`, `wdqs1003`->`wdqs1008`, `wdqs2008`->`wdqs2004`
[03:09:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:16:00] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:17:36] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1013.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1013.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[03:18:00] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1013.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[03:18:06] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:20:58] <ryankemper>	 ^ My mistake, forgot the cookbook in its current state is not repooling
[03:22:22] <ryankemper>	 !log T267927 Pooled `wdqs1006` and `wdqs2002`
[03:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:22:31] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[03:22:42] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:24:22] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[03:24:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[03:25:39] <ryankemper>	 Looks like `wdqs1013` was down for a bit (unsure) and the fact that I hadn't yet repooled `wdqs1006` meant pybal did not automatically de-pool it because we already had 2 of the 6 hosts down for the `data-transfer` before accounting for `wdqs1013` briefly dropping offline
[03:25:49] <ryankemper>	 s/(unsure)/(unsure why)/
[03:26:41] <ryankemper>	 !log T267927 Pooled `wdqs2001`
[03:26:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:30:04] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:31:50] <ryankemper>	 !log [wdqs] `ryankemper@wdqs1013:~$ sudo systemctl restart wdqs-blazegraph`
[03:31:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:34:38] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[04:05:20] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 19255 bytes in 4.951 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:05:36] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 19254 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:05:40] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 19253 bytes in 0.213 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:33:24] <wikibugs>	 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Ladsgroup) There was one thing I missed in my comment on autolader and Joe pointed out in IRC that I did the test on mwdebug (where you can do xhgui...
[05:17:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:25:27] <wikibugs>	 (03PS1) 10Legoktm: [WIP] lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237)
[05:26:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm)
[05:27:19] <wikibugs>	 (03PS2) 10Legoktm: [WIP] lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237)
[05:30:43] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage db2094,db2095 to buster [puppet] - 10https://gerrit.wikimedia.org/r/680162 (https://phabricator.wikimedia.org/T275112)
[05:31:02] <wikibugs>	 10ops-eqiad, 10Analytics: an-worker1100 disk swap required - https://phabricator.wikimedia.org/T280313 (10elukey)
[05:32:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2094,db2095 to buster [puppet] - 10https://gerrit.wikimedia.org/r/680162 (https://phabricator.wikimedia.org/T275112) (owner: 10Marostegui)
[05:35:11] <wikibugs>	 10ops-eqiad, 10Analytics: an-worker1100 disk swap required - https://phabricator.wikimedia.org/T280313 (10elukey)
[05:35:13] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10elukey)
[05:36:04] <wikibugs>	 (03PS3) 10Legoktm: [WIP] lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237)
[05:36:47] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29090/console" [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm)
[05:37:25] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on an-worker1100 - https://phabricator.wikimedia.org/T280132 (10elukey) @razzi   ` elukey@an-worker1100:~$ cat /proc/mounts  | grep /var/lib/hadoop/data /dev/sdx1 /var/lib/hadoop/data/w ext4 rw,relatime 0 0  /dev/sdl1 /var/lib/hadoop/data/k ext4 ro,relatime 0 0   <============...
[05:37:38] <wikibugs>	 (03PS4) 10Legoktm: lists: Notify when exclude_backups list is out of sync [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237)
[05:38:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-labs site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:42:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission for hosts analytics-tool1001.eqiad.wmnet
[05:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:05] <elukey>	 last vm using cloudera CDH packages --^
[05:43:46] <elukey>	 I am going to send another patch to clean up our repos very soon
[05:45:54] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Decommission analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/679892 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey)
[05:45:59] <wikibugs>	 (03PS3) 10Elukey: Decommission analytics-tool1001 [puppet] - 10https://gerrit.wikimedia.org/r/679892 (https://phabricator.wikimedia.org/T280262)
[05:48:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2094.codfw.wmnet with reason: REIMAGE
[05:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:49:32] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:50:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2094.codfw.wmnet with reason: REIMAGE
[05:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:50:55] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:54:40] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics-tool1001.eqiad.wmnet
[05:54:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:39] <wikibugs>	 (03PS1) 10Marostegui: db2094: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/680166 (https://phabricator.wikimedia.org/T275112)
[06:05:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2094: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/680166 (https://phabricator.wikimedia.org/T275112) (owner: 10Marostegui)
[06:06:45] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[06:09:10] <wikibugs>	 (03PS1) 10Elukey: Remove cloudera-related packages [puppet] - 10https://gerrit.wikimedia.org/r/680167 (https://phabricator.wikimedia.org/T280262)
[06:09:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-labs site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:09:20] <elukey>	 moritzm: --^ the day has finally come! :D
[06:10:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove cloudera-related packages [puppet] - 10https://gerrit.wikimedia.org/r/680167 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey)
[06:12:11] <wikibugs>	 (03PS2) 10Elukey: Remove cloudera-related packages [puppet] - 10https://gerrit.wikimedia.org/r/680167 (https://phabricator.wikimedia.org/T280262)
[06:13:27] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] Fix error message if MWScript.php is run without arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679517 (owner: 10Ahmon Dancy)
[06:16:27] <wikibugs>	 (03CR) 10Legoktm: lists: Add option to enable mailman3 on lists (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[06:16:36] <wikibugs>	 (03PS5) 10Legoktm: lists: Add option to enable mailman3 on lists [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[06:17:46] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29091/console" [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[06:19:15] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1 C: 03+2] lists: Add option to enable mailman3 on lists [puppet] - 10https://gerrit.wikimedia.org/r/678300 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[06:19:58] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2095.codfw.wmnet with reason: REIMAGE
[06:20:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:35] <wikibugs>	 (03CR) 10Elukey: "To keep archives happy - this can be done overriding the hadoop.log.dir var (with -D etc..), probably via the hadoop-env.sh file." [puppet] - 10https://gerrit.wikimedia.org/r/661391 (https://phabricator.wikimedia.org/T265126) (owner: 10Razzi)
[06:20:55] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1028.eqiad.wmnet
[06:21:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:22:22] <wikibugs>	 (03PS1) 10Marostegui: db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/680172 (https://phabricator.wikimedia.org/T275112)
[06:22:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2095.codfw.wmnet with reason: REIMAGE
[06:23:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:23:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/680172 (https://phabricator.wikimedia.org/T275112) (owner: 10Marostegui)
[06:26:05] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29092/console" [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712)
[06:27:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-labs site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:27:46] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1028.eqiad.wmnet
[06:27:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:59] <wikibugs>	 10SRE, 10SRE-tools, 10IPv6, 10User-jbond: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136 (10elukey) All good from the kafka-main2001 side! We can enable it everywhere
[06:39:32] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1029.eqiad.wmnet
[06:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:17] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::hadoop::standby: move hadoop dirs to /srv [puppet] - 10https://gerrit.wikimedia.org/r/680179 (https://phabricator.wikimedia.org/T265126)
[06:46:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29094/console" [puppet] - 10https://gerrit.wikimedia.org/r/680179 (https://phabricator.wikimedia.org/T265126) (owner: 10Elukey)
[06:47:01] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:48:04] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1029.eqiad.wmnet
[06:48:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:19] <logmsgbot>	 !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1030.eqiad.wmnet
[06:52:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:49] <logmsgbot>	 !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1030.eqiad.wmnet
[06:58:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210416T0700)
[07:07:17] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "See comment inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/680095 (https://phabricator.wikimedia.org/T280108) (owner: 10Ryan Kemper)
[07:13:33] <wikibugs>	 (03PS3) 10Amire80: Add default import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676930 (https://phabricator.wikimedia.org/T214139)
[07:19:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P15372 and previous config saved to /var/cache/conftool/dbconfig/20210416-071936-marostegui.json
[07:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:32] <wikibugs>	 (03CR) 10Gehel: postgres: use remote script on replica to resync (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan)
[07:21:45] <wikibugs>	 (03Abandoned) 10Gehel: Migrate wcqs to wcqs-beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/615810 (owner: 10ZPapierski)
[07:21:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for lexnasser [puppet] - 10https://gerrit.wikimedia.org/r/679854 (owner: 10Muehlenhoff)
[07:27:44] <wikibugs>	 (03PS1) 10Ema: cache: enable exp caching policy on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/680197 (https://phabricator.wikimedia.org/T275809)
[07:30:46] <wikibugs>	 (03PS2) 10Ema: cache: enable exp caching policy on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/680197 (https://phabricator.wikimedia.org/T275809)
[07:32:04] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/680197 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema)
[07:38:33] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:39:04] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: enable exp caching policy on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/680197 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema)
[07:41:09] <ema>	 !log cp-upload_ulsfo: rolling varnish-frontend-restart to apply exp policy settings changes starting from empty caches T275809
[07:41:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:20] <stashbot>	 T275809: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809
[07:42:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "https://www.youtube.com/watch?v=aCbfMkh940Q" [puppet] - 10https://gerrit.wikimedia.org/r/680167 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey)
[07:44:19] <ema>	 moritzm: nuke it! :D
[07:48:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove cloudera-related packages [puppet] - 10https://gerrit.wikimedia.org/r/680167 (https://phabricator.wikimedia.org/T280262) (owner: 10Elukey)
[07:49:07] <elukey>	 ema: it will be a good moment after https://www.cloudera.com/downloads/paywall-expansion.html
[07:52:11] <wikibugs>	 (03CR) 10Ema: [C: 03+1] trafficserver: remove comment about a server that won't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn)
[07:53:16] <elukey>	 !log run reprepro --delete clearvanished on apt1001 to clear all cloudera packages 
[07:53:22] <elukey>	 moritzm: it is done! \o/
[07:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:48] <moritzm>	 very nice :-)
[08:01:21] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] flaggedrevs.php: Use MediaWikiServices, not an extension function (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza)
[08:15:44] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "I suggested to do something like this for maintenance, but rlazarus mentioned it shouldn't be needed due to mailman2 being about to be dec" [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm)
[08:22:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add hmonroy to deployment [puppet] - 10https://gerrit.wikimedia.org/r/679811 (https://phabricator.wikimedia.org/T280177) (owner: 10Filippo Giunchedi)
[08:22:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: admin: add hmonroy to deployment [puppet] - 10https://gerrit.wikimedia.org/r/679811 (https://phabricator.wikimedia.org/T280177)
[08:22:36] <wikibugs>	 (03PS1) 10Majavah: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299)
[08:23:13] <wikibugs>	 (03PS2) 10Majavah: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-17 [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299)
[08:24:52] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10fgiunchedi)
[08:26:17] <wikibugs>	 (03PS3) 10Majavah: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-18 [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299)
[08:26:37] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access for HMonroy - https://phabricator.wikimedia.org/T280177 (10fgiunchedi) 05Open→03Resolved @HMonroy you are now a member of `deployment` group! Resolving task, please reopen if something is amiss.
[08:28:48] <wikibugs>	 (03CR) 10Gergő Tisza: "Assuming no train holdups, the extension patch lands on group2 on April 29, so this should be deployed May 3-ish." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza)
[08:31:23] <wikibugs>	 (03CR) 10Daniel Kinzler: flaggedrevs.php: Use MediaWikiServices, not an extension function (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza)
[08:31:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: admin: add awight to graphite-admins [puppet] - 10https://gerrit.wikimedia.org/r/679747 (https://phabricator.wikimedia.org/T280242) (owner: 10Awight)
[08:33:10] <wikibugs>	 (03PS1) 10David Caro: wmcs: Add link to runbook on puppet alerts. [puppet] - 10https://gerrit.wikimedia.org/r/680254
[08:33:25] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::analytics_cluster::hadoop::standby: move hadoop dirs to /srv [puppet] - 10https://gerrit.wikimedia.org/r/680179 (https://phabricator.wikimedia.org/T265126) (owner: 10Elukey)
[08:34:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add awight to graphite-admins [puppet] - 10https://gerrit.wikimedia.org/r/679747 (https://phabricator.wikimedia.org/T280242) (owner: 10Awight)
[08:34:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15373 and previous config saved to /var/cache/conftool/dbconfig/20210416-083431-root.json
[08:34:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:47] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10observability, 10Graphite, 10Patch-For-Review: Requesting access to graphite hosts for awight - https://phabricator.wikimedia.org/T280242 (10fgiunchedi)
[08:37:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10observability, 10Graphite, 10Patch-For-Review: Requesting access to graphite hosts for awight - https://phabricator.wikimedia.org/T280242 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is implemented now! @awight I've expanded a little https://wikitech.wikimedi...
[08:38:36] <wikibugs>	 (03PS1) 10Jcrespo: backups: move the exclude backups list for mailman to hiera [puppet] - 10https://gerrit.wikimedia.org/r/680255 (https://phabricator.wikimedia.org/T279237)
[08:39:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10fgiunchedi) p:05Triage→03Medium
[08:40:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:43:52] <wikibugs>	 (03CR) 10Jcrespo: "This is a noop: https://puppet-compiler.wmflabs.org/compiler1001/29095/backup1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/680255 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo)
[08:44:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] backups: move the exclude backups list for mailman to hiera [puppet] - 10https://gerrit.wikimedia.org/r/680255 (https://phabricator.wikimedia.org/T279237) (owner: 10Jcrespo)
[08:44:37] <wikibugs>	 10SRE, 10netops: Allow bast1003 in management routers (and drop bast1002) - https://phabricator.wikimedia.org/T280253 (10fgiunchedi) p:05Triage→03Medium
[08:44:58] <wikibugs>	 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10fgiunchedi) p:05Triage→03High
[08:45:07] <wikibugs>	 10SRE, 10Packaging: Package php-ast in {stretch,buster}-wikimedia/component - https://phabricator.wikimedia.org/T280210 (10fgiunchedi) p:05Triage→03Medium
[08:45:22] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13  (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10fgiunchedi) p:05Triage→03Medium
[08:45:46] <wikibugs>	 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10fgiunchedi) p:05Triage→03Medium
[08:47:25] <wikibugs>	 (03CR) 10Jcrespo: "So I've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/680255 which does the move of the list to hiera. You are now ready to " [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm)
[08:49:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15374 and previous config saved to /var/cache/conftool/dbconfig/20210416-084935-root.json
[08:49:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:02] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: move thanos-fe to raid0 for /srv [puppet] - 10https://gerrit.wikimedia.org/r/680257 (https://phabricator.wikimedia.org/T280257)
[08:52:09] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::hadoop::master: move hadoop dirs under /srv [puppet] - 10https://gerrit.wikimedia.org/r/680259 (https://phabricator.wikimedia.org/T265126)
[08:54:34] <wikibugs>	 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff)
[08:55:22] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29096/console" [puppet] - 10https://gerrit.wikimedia.org/r/680259 (https://phabricator.wikimedia.org/T265126) (owner: 10Elukey)
[08:57:02] <awight>	 godog: Thanks, I'm able to log in!  If you have one more moment to bump https://gerrit.wikimedia.org/r/c/679390 , that would unblock me on the next steps of actually deleting the metrics...
[08:57:19] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Update rewrite rule for https://www.mediawiki.org/FAQ [puppet] - 10https://gerrit.wikimedia.org/r/676508 (owner: 10DannyS712)
[08:57:41] <wikibugs>	 (03CR) 10Jcrespo: "BTW, I asked rlazarus if there was a pattern on the url, and he didn't say yes-not sure because unsure or because there is really not a pa" [puppet] - 10https://gerrit.wikimedia.org/r/680158 (https://phabricator.wikimedia.org/T279237) (owner: 10Legoktm)
[08:57:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:57:59] <godog>	 awight: sure, I'll merge that
[08:58:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Temporarily disable some reportupdater jobs [puppet] - 10https://gerrit.wikimedia.org/r/679390 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight)
[08:59:46] <elukey>	 awight: if you need any help with RU etc.. lemme know!
[09:00:39] <awight>	 :-D Thanks godog and elukey for the help, and on a Friday even.
[09:01:28] <awight>	 elukey: If it's an easy question to answer, do you know if reportupdater is quiescent at the moment?  Then I can trust that my jobs are really disabled...
[09:02:39] <awight>	 Ordinarily I would just wait a day, but in this case I'd like to keep momentum, there is a looming data retention deadline of May 1st...
[09:04:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15380 and previous config saved to /var/cache/conftool/dbconfig/20210416-090438-root.json
[09:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:20] <elukey>	 awight: let's move to #analytics :)
[09:05:37] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) a:03JMeybohm etcd cluster is set up now on conf200[4,5,6] although I had some trouble setting it up and I do not yet know why:  After the initial puppet runs, the ectd's rejec...
[09:06:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10ayounsi) @Cmjohnson asw2-a-eqiad and asw2-b-eqiad have outstanding changes, please make sure to commit them.  `lang=diff Changes for 1 device...
[09:09:29] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:09:29] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.193 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:10:57] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:12:25] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[09:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:39] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) Do you think, with the work done, we could drop support of jessie bacula backups (only etcd cluster was pending with jessie)?
[09:12:48] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup)
[09:13:30] <jayme>	 !log imported envoyproxy_1.15.4-1 to buster-wikimedia - T280317
[09:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:01] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Revert TLS 1.0 downgrade on storage servers (including director) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182)
[09:14:26] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup)
[09:15:03] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup)
[09:16:13] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) >>! In T271573#7008274, @jcrespo wrote: > Do you think, with the work done, we could drop support of jessie bacula backups (only etcd cluster was pending with jessie)?  If those...
[09:17:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:17:52] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] flaggedrevs.php: Use MediaWikiServices, not an extension function (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza)
[09:18:19] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service Gehel failed data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:18:19] <icinga-wm>	 ACKNOWLEDGEMENT - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.001 second response time Gehel failed data transfer - https://phabricator.wikimedia.org/T267927 https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:18:19] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS SPARQL on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 683 bytes in 1.181 second response time Gehel failed data transfer - https://phabricator.wikimedia.org/T267927 https://wikit
[09:18:20] <icinga-wm>	 /wiki/Wikidata_query_service/Runbook
[09:18:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10observability, 10Graphite: Requesting access to graphite hosts for awight - https://phabricator.wikimedia.org/T280242 (10awight) @fgiunchedi Thanks, I was able to start making these deletions :-)
[09:19:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15383 and previous config saved to /var/cache/conftool/dbconfig/20210416-091942-root.json
[09:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:16] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2009.codfw.wmnet
[09:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:24:50] <wikibugs>	 (03PS2) 10Filippo Giunchedi: install_server: move thanos-fe to raid0 for /srv [puppet] - 10https://gerrit.wikimedia.org/r/680257 (https://phabricator.wikimedia.org/T280257)
[09:25:17] <godog>	 seeking an easy +1 for ^
[09:27:51] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2009.codfw.wmnet
[09:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:37] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:30:12] <moritzm>	 godog: having a look in a bit
[09:30:54] <godog>	 moritzm: thank you! appreciate it
[09:31:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/680257 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi)
[09:33:02] <wikibugs>	 (03PS1) 10Awight: [DNM] Revert "Temporarily disable some reportupdater jobs" [puppet] - 10https://gerrit.wikimedia.org/r/680021 (https://phabricator.wikimedia.org/T279046)
[09:33:33] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2010.codfw.wmnet
[09:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: move thanos-fe to raid0 for /srv [puppet] - 10https://gerrit.wikimedia.org/r/680257 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi)
[09:34:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15384 and previous config saved to /var/cache/conftool/dbconfig/20210416-093446-root.json
[09:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:04] <wikibugs>	 (03PS1) 10JMeybohm: New envoy upstream version 1.15.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/680265 (https://phabricator.wikimedia.org/T280317)
[09:37:38] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] New envoy upstream version 1.15.4 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/680265 (https://phabricator.wikimedia.org/T280317) (owner: 10JMeybohm)
[09:39:30] <wikibugs>	 (03PS1) 10David Caro: wmsc: add role to the hiera hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324)
[09:40:04] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro)
[09:40:56] <wikibugs>	 (03PS2) 10David Caro: wmsc: add role to the hiera hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324)
[09:40:59] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2010.codfw.wmnet
[09:40:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove Python 2 packages on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/679768 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff)
[09:41:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:47] <wikibugs>	 (03PS3) 10David Caro: wmsc: add role to the hiera hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324)
[09:43:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=etcd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:47:11] <icinga-wm>	 RECOVERY - Thanos compact has not run on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:47:27] <icinga-wm>	 RECOVERY - Thanos compact is halted on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[09:52:50] <icinga-wm>	 PROBLEM - Thanos compact has disappeared from Prometheus discovery on alert1001 is CRITICAL: 1 ge 1 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview
[09:55:15] <jayme>	 !log imported envoyproxy_1.15.4-1 to stretch-wikimedia - T280317
[09:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:40] <godog>	 the thanos alerts are expected
[09:57:59] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2011.codfw.wmnet
[09:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:13] <wikibugs>	 (03CR) 10Jbond: "I think this is a great improvement and would love to see it merged.  However I'm cautions as i suspect there will be lots of local projec" [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro)
[09:59:51] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro)
[10:00:18] <jayme>	 !log updated envoyproxy to 1.15.4-1 on mwdebug1001.eqiad.wmnet
[10:00:21] <wikibugs>	 (03CR) 10Jbond: "> FYI" [puppet] - 10https://gerrit.wikimedia.org/r/680266 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro)
[10:00:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:02] <wikibugs>	 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10MoritzMuehlenhoff) @crusnov Could you please take modules/raid/files/check-raid.py with precedence? It's part of a Bullseye b...
[10:03:11] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: REIMAGE
[10:03:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:53] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2011.codfw.wmnet
[10:05:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:19] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2001.codfw.wmnet with reason: REIMAGE
[10:05:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:06] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342)
[10:08:13] <jayme>	 !log updated envoyproxy to 1.15.4-1 on mw1325.eqiad.wmnet,restbase1026.eqiad.wmnet
[10:08:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:33] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2012.codfw.wmnet
[10:08:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:18] <wikibugs>	 (03PS2) 10Ayounsi: Merge all system.conf templates in one [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345)
[10:10:20] <wikibugs>	 (03PS1) 10Ayounsi: Remove dump-on-panic [homer/public] - 10https://gerrit.wikimedia.org/r/680276 (https://phabricator.wikimedia.org/T269345)
[10:11:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] cr/firewall: add kafka-logging servers to labs-in filters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez)
[10:12:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "The change is fine, but it's a noop. The thing changed here is the fixtures for CI. What you want to change is https://gerrit.wikimedia.or" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko)
[10:12:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Envoy: set per_try_timeout for eventgate-main. [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko)
[10:13:12] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342)
[10:13:25] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove dump-on-panic [homer/public] - 10https://gerrit.wikimedia.org/r/680276 (https://phabricator.wikimedia.org/T269345) (owner: 10Ayounsi)
[10:13:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cr/firewall: add kafka-logging servers to labs-in filters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez)
[10:14:07] <wikibugs>	 (03Merged) 10jenkins-bot: Remove dump-on-panic [homer/public] - 10https://gerrit.wikimedia.org/r/680276 (https://phabricator.wikimedia.org/T269345) (owner: 10Ayounsi)
[10:15:54] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2012.codfw.wmnet
[10:16:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:33] <wikibugs>	 (03PS1) 10JMeybohm: citoid: Update envoy image to 1.15.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/680280 (https://phabricator.wikimedia.org/T280317)
[10:19:48] <icinga-wm>	 PROBLEM - Thanos store has high percentage of object storage failures on alert1001 is CRITICAL: job=thanos-compact https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store
[10:20:47] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2013.codfw.wmnet
[10:20:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:04] <icinga-wm>	 RECOVERY - Thanos store has high percentage of object storage failures on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store
[10:21:08] <icinga-wm>	 PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.496e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[10:27:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-18 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299) (owner: 10Majavah)
[10:28:11] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2013.codfw.wmnet
[10:28:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:29] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2014.codfw.wmnet
[10:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:06] <icinga-wm>	 RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01438 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact
[10:36:12] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] citoid: Update envoy image to 1.15.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/680280 (https://phabricator.wikimedia.org/T280317) (owner: 10JMeybohm)
[10:36:28] <icinga-wm>	 RECOVERY - Thanos compact has disappeared from Prometheus discovery on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview
[10:37:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] wmcs: Add link to runbook on puppet alerts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680254 (owner: 10David Caro)
[10:37:54] <wikibugs>	 (03PS1) 10Majavah: aptrepo: Remove unused thirdparty/kubeadm-k8s-1-1[56] [puppet] - 10https://gerrit.wikimedia.org/r/680285
[10:38:25] <wikibugs>	 (03CR) 10Majavah: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-18 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299) (owner: 10Majavah)
[10:38:28] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: Update envoy image to 1.15.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/680280 (https://phabricator.wikimedia.org/T280317) (owner: 10JMeybohm)
[10:39:06] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "From https://packages.debian.org/buster/bacula-common it looks like bacula uses OpenSSL 1.1 so I'm wondering if we shouldn't bump the mini" [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo)
[10:39:08] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez)
[10:39:50] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2014.codfw.wmnet
[10:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:11] <wikibugs>	 10SRE: Traceback in icinga-status  'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) 05Open→03Resolved
[10:40:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez)
[10:40:38] <wikibugs>	 10Puppet, 10SRE, 10puppet-compiler, 10User-jbond: in puppet 6 some core types have been moved to external modules.  check and confirm our exposure - https://phabricator.wikimedia.org/T265143 (10jbond) p:05Triage→03Medium
[10:41:02] <wikibugs>	 (03Merged) 10jenkins-bot: cr/firewall: add kafka-logging servers to labs-in filters [homer/public] - 10https://gerrit.wikimedia.org/r/679862 (https://phabricator.wikimedia.org/T279342) (owner: 10Arturo Borrero Gonzalez)
[10:42:46] <wikibugs>	 10Puppet, 10SRE, 10puppet-compiler, 10User-jbond: in puppet 6 some core types have been moved to external modules.  check and confirm our exposure - https://phabricator.wikimedia.org/T265143 (10jbond) puppet6 is not going to make it to bullseye so this issues is less urgent for now, that siad it looks like...
[10:43:13] <wikibugs>	 10Puppet, 10SRE, 10User-jbond: Update puppet infrastructure latest 5.5 version - https://phabricator.wikimedia.org/T265139 (10jbond) 05Open→03Resolved
[10:43:15] <wikibugs>	 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond)
[10:43:54] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: hiera_lookup failing to preform lookups after hiera5 upgrade - https://phabricator.wikimedia.org/T258931 (10jbond) p:05Triage→03Medium
[10:44:48] <arturo>	 !log merging homer change to cr-eqiad (T279342)
[10:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:56] <stashbot>	 T279342: Migrate colocated kafka-logging brokers to dedicated kafka-logging hosts - https://phabricator.wikimedia.org/T279342
[10:46:46] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.32.22:7001 on restbase2015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662
[10:47:10] <wikibugs>	 10Puppet, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: puppet populate failing on some nodes - https://phabricator.wikimedia.org/T248169 (10jbond) 05Open→03Resolved This is now possible using the `cumin:`  [[ https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Host_variable_override | Host...
[10:47:18] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.32.105:9042 on restbase2015 is CRITICAL: connect to address 10.192.32.105 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[10:47:38] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.32.105:7001 on restbase2015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662
[10:47:52] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.32.22:9042 on restbase2015 is CRITICAL: connect to address 10.192.32.22 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[10:48:28] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.32.25:7001 on restbase2015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662
[10:48:48] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.25:9042 on restbase2015 is CRITICAL: connect to address 10.192.32.25 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[10:49:20] <hnowlan>	 ^ that's me
[10:49:29] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2015.codfw.wmnet
[10:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:56] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: sonofgridengine: grid-configurator: introduce support for the new domain [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653)
[10:55:12] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.32.105:9042 on restbase2015 is OK: TCP OK - 0.033 second response time on 10.192.32.105 port 9042 https://phabricator.wikimedia.org/T93886
[10:55:12] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.25:9042 on restbase2015 is OK: TCP OK - 0.033 second response time on 10.192.32.25 port 9042 https://phabricator.wikimedia.org/T93886
[10:55:12] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.32.22:9042 on restbase2015 is OK: TCP OK - 0.033 second response time on 10.192.32.22 port 9042 https://phabricator.wikimedia.org/T93886
[10:55:12] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.32.105:7001 on restbase2015 is OK: SSL OK - Certificate restbase2015-c valid until 2022-10-08 10:53:48 +0000 (expires in 539 days) https://phabricator.wikimedia.org/T120662
[10:55:12] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.32.25:7001 on restbase2015 is OK: SSL OK - Certificate restbase2015-b valid until 2022-10-08 10:53:45 +0000 (expires in 539 days) https://phabricator.wikimedia.org/T120662
[10:55:13] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.32.22:7001 on restbase2015 is OK: SSL OK - Certificate restbase2015-a valid until 2022-10-08 10:53:43 +0000 (expires in 539 days) https://phabricator.wikimedia.org/T120662
[10:55:25] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2015.codfw.wmnet
[10:55:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:56] <wikibugs>	 10SRE, 10netops: Allow bast1003 in management routers (and drop bast1002) - https://phabricator.wikimedia.org/T280253 (10ayounsi) 05Open→03Resolved a:03ayounsi Done.
[10:58:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: Remove unused thirdparty/kubeadm-k8s-1-1[56] [puppet] - 10https://gerrit.wikimedia.org/r/680285 (owner: 10Majavah)
[10:58:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/680285 (owner: 10Majavah)
[11:01:17] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2016.codfw.wmnet
[11:01:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:55] <wikibugs>	 (03CR) 10David Caro: wmcs: Add link to runbook on puppet alerts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680254 (owner: 10David Caro)
[11:02:31] <moritzm>	 !log imported ferm 2.5.1-1+wmf1 to bullseye-wikimedia/main T275873
[11:02:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:41] <stashbot>	 T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873
[11:04:26] <wikibugs>	 (03PS4) 10Majavah: aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-18 [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299)
[11:08:19] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2016.codfw.wmnet
[11:08:21] <wikibugs>	 (03CR) 10Jcrespo: "vgutierrez: feel free to upgrade all client hosts to buster first :-) The downgrade to 1.0 was needed to support jessie hosts. We still ne" [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo)
[11:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:34] <wikibugs>	 (03CR) 10Jcrespo: "Also please help implement etcdv3/zookeeper backups so they can move away from jessie:" [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo)
[11:17:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Update aliases for Hue [puppet] - 10https://gerrit.wikimedia.org/r/680290
[11:31:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "I'll merge the code including the compat chunk for the old domain. It should be a NOOP if all nodes and config files are already using the" [puppet] - 10https://gerrit.wikimedia.org/r/677873 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez)
[11:39:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873)
[11:40:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff)
[11:56:53] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873)
[11:57:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff)
[12:00:35] <wikibugs>	 (03CR) 10Jcrespo: "One thing, let's add the current test host job names to the list of ignorelist for production monitoring at: https://phabricator.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff)
[12:06:20] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[12:09:18] <wikibugs>	 (03PS1) 10Urbanecm: testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680302 (https://phabricator.wikimedia.org/T279853)
[12:09:20] <wikibugs>	 (03PS1) 10Urbanecm: testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680303 (https://phabricator.wikimedia.org/T279853)
[12:09:22] <wikibugs>	 (03PS1) 10Urbanecm: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680304 (https://phabricator.wikimedia.org/T279853)
[12:09:48] <wikibugs>	 (03PS1) 10Jcrespo: backups: Disable bacula monintoring for sretest hosts [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873)
[12:10:10] <wikibugs>	 (03PS2) 10Jcrespo: backups: Disable bacula monintoring for sretest hosts [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873)
[12:10:37] <wikibugs>	 (03PS3) 10Jcrespo: backups: Disable bacula monitoring for sretest hosts [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873)
[12:11:03] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "not now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680303 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm)
[12:11:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "not now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680304 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm)
[12:12:07] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Only blocked on the specific day of the week in which they end up setup (currently, Friday)- which is deterministic but spread out among t" [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) (owner: 10Jcrespo)
[12:12:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "this sounds like a good thing to do" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679938 (owner: 10Gergő Tisza)
[12:14:30] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1060-production-search-eqiad on elastic1060 is CRITICAL: 135.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-eqiad&var-instance=elastic1060&panelId=37
[12:15:26] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[12:15:26] <dcausse>	 should recover in a sec ^
[12:21:18] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: fix reimage race on /var/log/swift [puppet] - 10https://gerrit.wikimedia.org/r/680307
[12:21:20] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: extend retention to 5y [puppet] - 10https://gerrit.wikimedia.org/r/680308
[12:21:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: hide diffs for files with sensitive data [puppet] - 10https://gerrit.wikimedia.org/r/680309 (https://phabricator.wikimedia.org/T280257)
[12:22:21] <jayme>	 !log updated envoyproxy to 1.15.4-1 on 'A:mw-canary or A:restbase-canary'
[12:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' .
[12:25:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:12] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 24717 bytes in 3.376 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[12:31:28] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 24706 bytes in 0.558 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[12:31:58] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 24705 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[12:37:18] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .
[12:37:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:39] <wikibugs>	 (03PS1) 10JMeybohm: Revert "citoid: Update envoy image to 1.15.4-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/680024
[12:39:02] <wikibugs>	 (03PS2) 10JMeybohm: Revert "citoid: Update envoy image to 1.15.4-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/680024 (https://phabricator.wikimedia.org/T280317)
[12:41:02] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "this should be good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676930 (https://phabricator.wikimedia.org/T214139) (owner: 10Amire80)
[12:41:18] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2017.codfw.wmnet
[12:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:59] <wikibugs>	 (03PS3) 10Jbond: Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff)
[12:46:21] <wikibugs>	 (03CR) 10Jcrespo: "sorry, wrong patch." [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff)
[12:47:12] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2017.codfw.wmnet
[12:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:43] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
[12:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:26] <wikibugs>	 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Manuel) @KFrancis Thank you for sending the document! I just reviewed and signed.
[12:48:37] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2018.codfw.wmnet
[12:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:02] <wikibugs>	 10Puppet, 10SRE, 10User-jbond: wmf-stylguid checks: unable to ignore violations inside roles - https://phabricator.wikimedia.org/T280353 (10jbond) p:05Triage→03Low
[12:54:34] <wikibugs>	 (03PS2) 10Elukey: Update aliases for Hue [puppet] - 10https://gerrit.wikimedia.org/r/680290 (owner: 10Muehlenhoff)
[12:54:51] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2018.codfw.wmnet
[12:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Thank youuu I missed one!" [puppet] - 10https://gerrit.wikimedia.org/r/680290 (owner: 10Muehlenhoff)
[12:55:27] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Update aliases for Hue [puppet] - 10https://gerrit.wikimedia.org/r/680290 (owner: 10Muehlenhoff)
[12:55:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Update aliases for Hue [puppet] - 10https://gerrit.wikimedia.org/r/680290 (owner: 10Muehlenhoff)
[12:55:53] <wikibugs>	 (03PS4) 10Muehlenhoff: Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873)
[12:59:12] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2019.codfw.wmnet
[12:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:52] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic1060-production-search-eqiad on elastic1060 is OK: (C)100 gt (W)80 gt 16.26 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-eqiad&var-instance=elastic1060&panelId=37
[13:07:33] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2019.codfw.wmnet
[13:07:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:57] <godog>	 jayme: FYI prometheus can't scrape etcd metrics on conf200[456]:4001 since earlier today
[13:11:09] <jayme>	 godog: oh, thanks. Will take a look
[13:13:02] <godog>	 np!
[13:20:14] <wikibugs>	 (03CR) 10Amire80: "Thanks for +1!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676930 (https://phabricator.wikimedia.org/T214139) (owner: 10Amire80)
[13:21:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable bacula on sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680298 (https://phabricator.wikimedia.org/T257873) (owner: 10Muehlenhoff)
[13:25:02] <wikibugs>	 (03PS1) 10Ladsgroup: exim: Add support for handling mailman3 inside mailman2 conf [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612)
[13:25:30] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[13:26:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] exim: Add support for handling mailman3 inside mailman2 conf [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[13:26:21] <wikibugs>	 (03PS4) 10Jcrespo: backups: Disable bacula monitoring for sretest hosts [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873)
[13:26:32] <wikibugs>	 (03PS5) 10Jcrespo: backups: Disable bacula monitoring for sretest hosts [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873)
[13:27:27] <wikibugs>	 (03PS6) 10Jcrespo: backups: Disable bacula monitoring for sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873)
[13:27:36] <wikibugs>	 (03PS2) 10Ladsgroup: exim: Add support for handling mailman3 inside mailman2 conf [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612)
[13:29:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) (owner: 10Jcrespo)
[13:29:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] backups: Disable bacula monitoring for sretest1002 [puppet] - 10https://gerrit.wikimedia.org/r/680305 (https://phabricator.wikimedia.org/T257873) (owner: 10Jcrespo)
[13:31:31] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[13:34:35] <wikibugs>	 (03PS1) 10David Caro: WIP wmcs.enc: Add role of the machine [puppet] - 10https://gerrit.wikimedia.org/r/680329 (https://phabricator.wikimedia.org/T280324)
[13:39:58] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[13:56:31] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) The tlsproxy currently serves a certificate not valid for conf200[4,5,6] (Prometheus errors with: `Get https://conf2004:4001/metrics: x509: certificate is valid for conf2001.cod...
[14:01:21] <wikibugs>	 (03CR) 10Muehlenhoff: systemd::timer::job: update mailing script with additional options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond)
[14:06:14] <wikibugs>	 (03PS1) 10Ladsgroup: exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612)
[14:06:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[14:07:41] <wikibugs>	 (03PS2) 10Ladsgroup: exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612)
[14:08:28] <wikibugs>	 (03PS1) 10Herron: Revert "kafka-logging1003: disable notifications during setup" [puppet] - 10https://gerrit.wikimedia.org/r/680025
[14:08:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[14:09:53] <wikibugs>	 (03PS3) 10Ladsgroup: exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612)
[14:11:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[14:11:29] <wikibugs>	 (03CR) 10Herron: [C: 03+2] Revert "kafka-logging1003: disable notifications during setup" [puppet] - 10https://gerrit.wikimedia.org/r/680025 (owner: 10Herron)
[14:12:08] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Relax CSP rules for doc.wm.org/mediawiki-tools-phan-SecurityCheckPlugin [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301)
[14:14:53] <wikibugs>	 (03PS4) 10Ladsgroup: exim: Fix TLS cert for cloud (in mailman) [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612)
[14:18:18] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2020.codfw.wmnet
[14:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:57] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2020.codfw.wmnet
[14:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:51] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.153:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662
[14:28:23] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[14:29:15] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.16.154:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662
[14:29:19] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.155:7001 on restbase2021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662
[14:31:19] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2021.codfw.wmnet
[14:31:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:51] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 76 probes of 620 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:37:09] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.153:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-a valid until 2022-01-15 15:52:56 +0000 (expires in 274 days) https://phabricator.wikimedia.org/T120662
[14:37:09] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.16.154:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-b valid until 2022-01-15 15:52:58 +0000 (expires in 274 days) https://phabricator.wikimedia.org/T120662
[14:37:15] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.16.155:7001 on restbase2021 is OK: SSL OK - Certificate restbase2021-c valid until 2022-01-15 15:53:01 +0000 (expires in 274 days) https://phabricator.wikimedia.org/T120662
[14:38:53] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is OK: TCP OK - 0.032 second response time on 10.192.16.153 port 9042 https://phabricator.wikimedia.org/T93886
[14:39:22] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2021.codfw.wmnet
[14:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:35] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 620 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:42:35] <wikibugs>	 (03PS3) 10Herron: replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565)
[14:43:06] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2022.codfw.wmnet
[14:43:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:40] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) https://polymorphic.lists.wmcloud.org/pipermail/test-on-two/2021-April/000001.html  https://polymorphic.lists.wmcloud.org/mailman3/postor...
[14:44:42] <wikibugs>	 (03CR) 10Herron: replace mwlog1001 with new mwlog[12]002 hosts (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron)
[14:44:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron)
[14:49:49] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2022.codfw.wmnet
[14:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:15] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2023.codfw.wmnet
[14:50:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:58] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on restbase-dev[1005-1006].eqiad.wmnet with reason: restarting for kernel update
[14:51:59] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on restbase-dev[1005-1006].eqiad.wmnet with reason: restarting for kernel update
[14:52:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:16] <wikibugs>	 10SRE: creation of raju@wikipedia.org for fundraising team - https://phabricator.wikimedia.org/T280371 (10MNoorWMF)
[14:56:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:46] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on restbase-dev1006.eqiad.wmnet with reason: restarting for kernel update
[14:58:48] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on restbase-dev1006.eqiad.wmnet with reason: restarting for kernel update
[14:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:32] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2023.codfw.wmnet
[14:59:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:26] <hnowlan>	 (∩｀-´)⊃━☆ﾟ.*･｡ﾟ ~*restbase spam complete*~
[15:09:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1016.wikime...
[15:14:42] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "Looks good just a minor issue, see comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680329 (https://phabricator.wikimedia.org/T280324) (owner: 10David Caro)
[15:15:37] <wikibugs>	 (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/680368
[15:15:45] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/680368 (owner: 10Kosta Harlan)
[15:19:04] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/680368 (owner: 10Kosta Harlan)
[15:22:45] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[15:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:19] <wikibugs>	 (03PS8) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292
[15:27:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1016.wikimedia.org'] `  Of which those **FAILED**: ` ['clou...
[15:28:14] <wikibugs>	 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) >>! In T247364#7008439, @MoritzMuehlenhoff wrote: > @crusnov Could you please take modules/raid/files/check-raid.py...
[15:29:19] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) exim fully works.  The only thing broken currently is the archiver. The apache config is being weird probably: ` WARNING 2021-04-16 15:25...
[15:29:55] <wikibugs>	 (03PS9) 10Jbond: systemd::timer::job: update mailing script with additional options [puppet] - 10https://gerrit.wikimedia.org/r/679292
[15:29:57] <wikibugs>	 (03PS4) 10Jbond: check_cumin_aliases: ensure script exits 1 on error [puppet] - 10https://gerrit.wikimedia.org/r/679297
[15:30:18] <wikibugs>	 (03PS7) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293
[15:31:15] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' .
[15:31:15] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[15:31:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:34] <wikibugs>	 (03PS8) 10Jbond: P:debmonitor::client: migrate timer::job to use send_mail [puppet] - 10https://gerrit.wikimedia.org/r/679293
[15:33:11] <wikibugs>	 (03CR) 10Jbond: "updated thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond)
[15:34:38] <wikibugs>	 (03PS1) 10Ppchelko: Envoy: set per_try_timeout for eventgate-main. [puppet] - 10https://gerrit.wikimedia.org/r/680372 (https://phabricator.wikimedia.org/T249745)
[15:36:37] <wikibugs>	 (03CR) 10Ppchelko: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko)
[15:36:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1016.wikime...
[15:37:47] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Ladsgroup) ` [Fri Apr 16 15:25:54.109507 2021] [proxy_http:debug] [pid 12009:tid 140593783097088] mod_proxy_http.c(1920): [client 172.16.4.88:59876]...
[15:43:21] <icinga-wm>	 PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/
[15:43:31] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' .
[15:43:32] <logmsgbot>	 !log urbanecm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[15:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:15] <wikibugs>	 (03PS1) 10Ottomata: Update eventgate-logging-external kafka brokers list [deployment-charts] - 10https://gerrit.wikimedia.org/r/680375 (https://phabricator.wikimedia.org/T279342)
[15:51:37] <wikibugs>	 (03PS1) 10David Caro: icinga: allow clearing a downtime for a host [puppet] - 10https://gerrit.wikimedia.org/r/680376
[15:52:37] <wikibugs>	 (03PS2) 10Ottomata: Update eventgate-logging-external kafka brokers list in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/680375 (https://phabricator.wikimedia.org/T279342)
[15:52:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: allow clearing a downtime for a host [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro)
[15:53:36] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Update eventgate-logging-external kafka brokers list in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/680375 (https://phabricator.wikimedia.org/T279342) (owner: 10Ottomata)
[15:54:23] <wikibugs>	 (03PS2) 10David Caro: icinga: allow clearing a downtime for a host [puppet] - 10https://gerrit.wikimedia.org/r/680376
[15:55:27] <wikibugs>	 (03PS1) 10Jbond: P:debmonitopr::client: add correct owner to ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/680378
[15:56:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29097/console" [puppet] - 10https://gerrit.wikimedia.org/r/680378 (owner: 10Jbond)
[15:56:38] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Update eventgate-logging-external kafka brokers list in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/680375 (https://phabricator.wikimedia.org/T279342) (owner: 10Ottomata)
[15:56:39] <icinga-wm>	 RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/
[15:57:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitopr::client: add correct owner to ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/680378 (owner: 10Jbond)
[15:57:28] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[15:57:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:37] <wikibugs>	 10SRE, 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata)
[15:58:31] <icinga-wm>	 RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:00:39] <wikibugs>	 10SRE, 10Traffic: Protect our users against Google-driven privacy breach via FLOC - https://phabricator.wikimedia.org/T280377 (10Joe)
[16:02:35] <wikibugs>	 10SRE, 10Traffic: Protect our users against Google-driven privacy breach via FLOC - https://phabricator.wikimedia.org/T280377 (10Joe) Please note that while we surely won't use the js api in our base javascript, this is intended as a defensive measure for all third-party js and software we run,
[16:03:44] <wikibugs>	 (03CR) 10Jcrespo: "Looking good, but let's battle test it next week (not on friday afternoon) for the edge cases (e.g. not requiring a message for removing d" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro)
[16:05:16] <wikibugs>	 (03PS1) 10Ottomata: eventgate-logging-external - update networkpolicy with new kafka brokers [deployment-charts] - 10https://gerrit.wikimedia.org/r/680380 (https://phabricator.wikimedia.org/T279342)
[16:05:31] <wikibugs>	 (03CR) 10Ottomata: "Ohp, this is needed too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/680380 (https://phabricator.wikimedia.org/T279342) (owner: 10Ottomata)
[16:05:58] <wikibugs>	 (03PS2) 10Ottomata: eventgate-logging-external - update networkpolicy with new kafka brokers [deployment-charts] - 10https://gerrit.wikimedia.org/r/680380 (https://phabricator.wikimedia.org/T279342)
[16:06:16] <wikibugs>	 10SRE, 10Traffic: Protect our users against Google-driven privacy breach via FLOC - https://phabricator.wikimedia.org/T280377 (10Joe)
[16:07:23] <wikibugs>	 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10Joe)
[16:08:31] <wikibugs>	 10SRE, 10Privacy Engineering, 10Traffic, 10Patch-For-Review, 10Privacy: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10Joe) >>! In T279804#7003701, @ori wrote: > Is the header needed at all? >  > https://github.com/WICG/floc...
[16:08:35] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - update networkpolicy with new kafka brokers [deployment-charts] - 10https://gerrit.wikimedia.org/r/680380 (https://phabricator.wikimedia.org/T279342) (owner: 10Ottomata)
[16:09:14] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[16:09:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:46] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[16:13:47] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[16:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:14] <wikibugs>	 (03PS1) 10Elukey: hadoop: improve default log4j config [puppet] - 10https://gerrit.wikimedia.org/r/680383 (https://phabricator.wikimedia.org/T276906)
[16:25:47] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10JKatzWMF) Not sure why it didn't work, but thanks for walking through it with me yesterday.  Added eyener@wikimedia.org to the domains you requested, @Pco...
[16:25:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro)
[16:37:47] <wikibugs>	 10SRE, 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) Actually, I'm not sure even just doing LVS would help here.  The helmfiles networkpolicy explicitly lists IP addresses that the servic...
[16:52:59] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] "It doesn't fix the issue." [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup)
[16:55:28] <wikibugs>	 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Schlurcher) Thanks for adding me as a subscriber, as my bot apparently caused this issue. The bot has been performing these actions with an edit rat...
[16:56:34] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Jdforrester-WMF)
[16:59:59] <logmsgbot>	 !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[17:00:00] <logmsgbot>	 !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[17:00:02] <logmsgbot>	 !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[17:00:04] <logmsgbot>	 !log ryankemper@cumin2001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[17:00:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:23] <icinga-wm>	 PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/
[17:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:36] <ryankemper>	 !log T267927 Following data transfers complete: `wdqs1004`->`wdqs1007`, `wdqs2001`->`wdqs2003`, `wdqs1003`->`wdqs1008`, `wdqs2008`->`wdqs2004`
[17:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:44] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[17:03:46] <ryankemper>	 !log T267927 Pooled `wdqs1007`, `wdqs2003`, `wdqs1008`, `wdqs2004`
[17:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:53] <icinga-wm>	 PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:07:03] <icinga-wm>	 PROBLEM - puppet last run on wdqs2004 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:07:03] <icinga-wm>	 PROBLEM - puppet last run on wdqs2003 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:07:53] <icinga-wm>	 PROBLEM - puppet last run on wdqs1008 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:08:35] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:10:43] <icinga-wm>	 PROBLEM - puppet last run on wdqs2008 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:13:23] <icinga-wm>	 RECOVERY - puppet last run on wdqs2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:13:23] <icinga-wm>	 RECOVERY - puppet last run on wdqs2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:14:11] <icinga-wm>	 RECOVERY - puppet last run on wdqs1008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:15:37] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:17:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloudcephosd1016.wikime...
[17:22:45] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:23:19] <wikibugs>	 10SRE, 10Privacy Engineering, 10Traffic, 10fundraising-tech-ops, and 2 others: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10Dwisehaupt)
[17:23:25] <icinga-wm>	 RECOVERY - puppet last run on wdqs2008 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:29:43] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:30:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "just a comment" [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn)
[17:30:24] <wikibugs>	 (03PS4) 10Dzahn: trafficserver: remove comment about a server that won't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/679530 (https://phabricator.wikimedia.org/T280203)
[17:31:37] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1016.wikimedia.org with reason: REIMAGE
[17:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:52] <wikibugs>	 10SRE, 10Privacy Engineering, 10Traffic, 10fundraising-tech-ops, and 2 others: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10Dwisehaupt) Header added to fundraising nginx templates and deployed. ` [frack::puppet] 58ed92cf Add...
[17:32:06] <wikibugs>	 (03PS1) 10Dzahn: conftool/DCHP: decom mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/680391
[17:32:11] <icinga-wm>	 RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[17:33:11] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1017.wikimedia.org with reason: REIMAGE
[17:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:47] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1016.wikimedia.org with reason: REIMAGE
[17:33:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:00] <wikibugs>	 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Legoktm) >>! In T280232#7009873, @Schlurcher wrote: > Thanks for adding me as a subscriber, as my bot apparently caused this issue. The bot has been...
[17:34:21] <wikibugs>	 (03PS1) 10Dzahn: trafficserver: remove mwdebug1003 from x-wikimedia-debug-routing [puppet] - 10https://gerrit.wikimedia.org/r/680393
[17:34:57] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mwdebug1003.eqiad.wmnet
[17:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:19] <mutante>	 !log depooling mwdebug1003 (stretch VM, will be removed), mwdebug1001/1002 (buster) and unchanged
[17:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:46] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1017.wikimedia.org with reason: REIMAGE
[17:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) cloudcephods1018:  Broadcom UNDI PXE-2.1 v21.6.0 Copyright (C) 2000-2020 Broadcom Corporation Copyright (C) 1997-2000 Inte...
[17:37:11] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1019.wikimedia.org with reason: REIMAGE
[17:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH)
[17:39:15] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1019.wikimedia.org with reason: REIMAGE
[17:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:38] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1020.wikimedia.org with reason: REIMAGE
[17:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:43] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1020.wikimedia.org with reason: REIMAGE
[17:41:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:47] <wikibugs>	 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10KFrancis) @Manuel I am confirming receipt of the signed NDA.  Please proceed with next steps.  Thanks!
[17:47:47] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[17:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:15] <ryankemper>	 !log T267927 Transferring from `wdqs2008`->`wdqs2003` to resolve the data corruption on `wdqs2003`
[17:48:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:22] <stashbot>	 T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927
[17:49:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Dzahn) Thanks @KFrancis !  @Manuel please register a user on the https://wikitech.wikimedia.org wiki and let us know the username you picked.  Then we can add you to the LDAP groups....
[17:49:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10Pcoombe) @JKatzWMF Thanks. Can you add Erin to the mobile *.m.wikipedia subdomains as well? Sorry to be a pain!
[17:53:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH)
[17:55:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) a:05RobH→03Jclark-ctr >>! In T274945#7010007, @RobH wrote: > cloudcephods1018: >  > Broadcom UNDI PXE-2.1 v21.6.0 > Co...
[17:59:47] <wikibugs>	 (03PS1) 10Cwhite: logstash: limit apifeatureusage curator job to jobs_host [puppet] - 10https://gerrit.wikimedia.org/r/680399 (https://phabricator.wikimedia.org/T274394)
[18:01:40] <wikibugs>	 (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1003/29098/" [puppet] - 10https://gerrit.wikimedia.org/r/680399 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite)
[18:05:39] <wikibugs>	 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jclark-ctr)
[18:05:51] <wikibugs>	 (03CR) 10Cwhite: "I suspect there is an issue with the apifeatureusage forcemerge action.  Due to all the instances erroring simultaneously for different re" [puppet] - 10https://gerrit.wikimedia.org/r/680399 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite)
[18:07:18] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] swift: fix reimage race on /var/log/swift [puppet] - 10https://gerrit.wikimedia.org/r/680307 (owner: 10Filippo Giunchedi)
[18:07:39] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] swift: hide diffs for files with sensitive data [puppet] - 10https://gerrit.wikimedia.org/r/680309 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi)
[18:08:01] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] thanos: extend retention to 5y [puppet] - 10https://gerrit.wikimedia.org/r/680308 (owner: 10Filippo Giunchedi)
[18:09:47] <wikibugs>	 (03Abandoned) 10Cwhite: logstash: use curator cluster config when possible [puppet] - 10https://gerrit.wikimedia.org/r/676631 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite)
[18:10:03] <wikibugs>	 (03Abandoned) 10Cwhite: logstash: set cluster name for elasticsearch outputs [puppet] - 10https://gerrit.wikimedia.org/r/676685 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite)
[18:10:28] <wikibugs>	 (03Abandoned) 10Cwhite: logstash: use logstash output to manage ecs-test indexes [puppet] - 10https://gerrit.wikimedia.org/r/676687 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite)
[18:10:54] <wikibugs>	 (03Abandoned) 10Cwhite: logstash: add curator config to manage w3creportingapi revision 1 indexes [puppet] - 10https://gerrit.wikimedia.org/r/676690 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite)
[18:10:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10Jclark-ctr) @Cmjohnson  my mistake it was C5 for 1003. netbox was correct though
[18:11:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[18:13:01] <wikibugs>	 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Keegan)
[18:13:08] <icinga-wm>	 RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/
[18:13:25] <wikibugs>	 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Keegan) 05Open→03Resolved
[18:14:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10Jclark-ctr)
[18:16:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10Jclark-ctr)
[18:17:38] <wikibugs>	 (03PS21) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855)
[18:17:39] <wikibugs>	 (03CR) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov)
[18:21:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10JKatzWMF) @Pcoombe my bad.  done now.
[18:21:43] <wikibugs>	 (03CR) 10CRusnov: "I may have addressed the comments." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov)
[18:24:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for pcoombe@wikimedia.org - https://phabricator.wikimedia.org/T277065 (10EYener) Thanks @JKatzWMF - I'm seeing the mobile subdomains now! And thanks @Pcoombe for all your help here.
[18:27:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov)
[18:34:49] <wikibugs>	 (03PS22) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855)
[18:41:05] <wikibugs>	 (03PS1) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405
[18:42:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855) (owner: 10CRusnov)
[18:46:54] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Rename deployment-cache-(text|upload)0x to deployment-cp0x - https://phabricator.wikimedia.org/T280393 (10Krinkle) p:05Triage→03Low Yeah, there's no rush definitely. Just something to keep in mind for next time there's something to do around these instances.
[18:53:16] <wikibugs>	 (03PS2) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405
[19:00:29] <wikibugs>	 (03PS3) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405
[19:06:48] <wikibugs>	 (03PS4) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405
[19:16:30] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[19:16:46] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:17:42] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.211 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[19:18:42] <wikibugs>	 (03PS5) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405
[19:23:46] <wikibugs>	 (03PS6) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405
[20:40:33] <Trey314159>	 !log reindexing wikidata on cloudelastic... AGAIN (T274200)
[20:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:42] <stashbot>	 T274200: Reindex English and Italian wikis to enable homoglyph plugin - https://phabricator.wikimedia.org/T274200
[20:47:04] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] varnish: add anti-FLoC header to responses [puppet] - 10https://gerrit.wikimedia.org/r/679866 (https://phabricator.wikimedia.org/T279804) (owner: 10Dave Pifke)
[20:54:49] <wikibugs>	 (03PS23) 10CRusnov: dhcp: Add module for manipulating dynamic DHCP entries [software/spicerack] - 10https://gerrit.wikimedia.org/r/675932 (https://phabricator.wikimedia.org/T269855)
[21:09:26] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 147.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37
[21:09:34] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-chi-eqiad on cloudelastic1006 is CRITICAL: 138.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1006&panelId=37
[21:09:53] <wikibugs>	 10SRE: Role with quote in description causes bash syntax error - https://phabricator.wikimedia.org/T276868 (10razzi) 05Open→03Resolved a:03razzi This has been fixed!
[21:40:36] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 114.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37
[21:47:56] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for ptwikinews - https://phabricator.wikimedia.org/T280408 (10Peachey88)
[21:52:46] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37
[22:03:24] <wikibugs>	 10SRE, 10Privacy Engineering, 10Traffic, 10fundraising-tech-ops, and 2 others: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC) - https://phabricator.wikimedia.org/T279804 (10dpifke) 05Open→03Resolved a:03dpifke
[22:04:34] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[22:09:38] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[22:41:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10wiki_willy) Hi @Jclark-ctr - there's a Netbox error associated with these serial numbers also.  Looks like cloudcephosd1019 and...
[22:50:50] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37
[23:34:18] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn)
[23:36:43] <wikibugs>	 (03PS1) 10Dzahn: DHCP: switch mw1307 to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/680483 (https://phabricator.wikimedia.org/T245757)
[23:36:55] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw[1402-1403].eqiad.wmnet with reason: reimage
[23:36:55] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw[1402-1403].eqiad.wmnet with reason: reimage
[23:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:11] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[23:39:15] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[23:39:41] <mutante>	 !log reimaging last 3 remaining stretch appservers with buster, mw1307, mw1402, mw1403
[23:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:00] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host...
[23:41:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: switch mw1307 to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/680483 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[23:47:28] <mutante>	 !log decom'ing mwdebug1003, stretch VM created in T267248
[23:47:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:47:37] <stashbot>	 T267248: create mwdebug1003 - ganeti VM with buster and appserver role - https://phabricator.wikimedia.org/T267248
[23:47:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mwdebug1003.eqiad.wmnet
[23:48:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:46] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mwdebug1003.eqiad.wmnet
[23:48:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] conftool/DCHP: decom mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/680391 (owner: 10Dzahn)
[23:49:59] <wikibugs>	 (03PS2) 10Dzahn: conftool/DCHP: decom mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/680391
[23:52:09] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn)
[23:53:28] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) Remaining 3 special cases kept on stretch now reimaged to buster as well.  Decom'ed mwdebug1...
[23:56:07] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1402.eqiad.wmnet with reason: REIMAGE
[23:56:12] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1403.eqiad.wmnet with reason: REIMAGE
[23:56:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:58:09] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1402.eqiad.wmnet with reason: REIMAGE
[23:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:58:16] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mwdebug1003.eqiad.wmnet
[23:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:59:16] <wikibugs>	 (03PS2) 10Dzahn: trafficserver: remove mwdebug1003 from x-wikimedia-debug-routing [puppet] - 10https://gerrit.wikimedia.org/r/680393 (https://phabricator.wikimedia.org/T267248)