[00:05:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:07:55] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1158932
[00:07:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1158932 (owner: 10TrainBranchBot)
[00:10:12] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:22:45] <jinxer-wm>	 FIRING: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:27:37] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1158932 (owner: 10TrainBranchBot)
[00:52:45] <jinxer-wm>	 FIRING: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:57:45] <jinxer-wm>	 RESOLVED: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[01:33:29] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on aqs1012 is CRITICAL: CRITICAL: State: degraded, Active: 8, Working: 8, Failed: 4, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T396970 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[01:33:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970 (10ops-monitoring-bot) 03NEW
[02:01:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[02:06:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[02:09:12] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:13:03] <wikibugs>	 06SRE, 06SRE Observability: monitoring ACKs should be delivered via SMS - https://phabricator.wikimedia.org/T396894#10916279 (10lmata)
[02:22:18] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:25:18] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:31:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:57:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:14:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:17:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:05:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:10:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) (owner: 10Anzx)
[04:14:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:17:28] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:52:02] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance
[04:57:31] <wikibugs>	 (03PS1) 10Marostegui: db2204: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159067 (https://phabricator.wikimedia.org/T396549)
[04:57:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2204 T396549', diff saved to https://phabricator.wikimedia.org/P77957 and previous config saved to /var/cache/conftool/dbconfig/20250616-045738-marostegui.json
[04:57:43] <stashbot>	 T396549: Migrate s2 to MariaDB 10.11 - https://phabricator.wikimedia.org/T396549
[04:58:07] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2204.codfw.wmnet with reason: Maintenance
[04:58:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2204: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159067 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui)
[05:01:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77958 and previous config saved to /var/cache/conftool/dbconfig/20250616-050139-root.json
[05:02:52] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1159072 (https://phabricator.wikimedia.org/T396976)
[05:06:12] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[05:06:30] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[05:06:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T396130)', diff saved to https://phabricator.wikimedia.org/P77959 and previous config saved to /var/cache/conftool/dbconfig/20250616-050637-marostegui.json
[05:06:41] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:16:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77960 and previous config saved to /var/cache/conftool/dbconfig/20250616-051644-root.json
[05:20:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[05:25:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T396130)', diff saved to https://phabricator.wikimedia.org/P77961 and previous config saved to /var/cache/conftool/dbconfig/20250616-052530-marostegui.json
[05:25:34] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[05:29:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10916392 (10Stevemunene) Did the raid) config with   ` stevemunene@an-worker1157:~$  sudo perccli64 /c0 add vd each r0 w...
[05:31:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77962 and previous config saved to /var/cache/conftool/dbconfig/20250616-053150-root.json
[05:33:37] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1160.eqiad.wmnet
[05:35:08] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1160.eqiad.wmnet
[05:35:43] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1162.eqiad.wmnet
[05:37:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156741 (https://phabricator.wikimedia.org/T395871) (owner: 10Svantje Lilienthal)
[05:37:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10916401 (10Stevemunene)
[05:37:23] <logmsgbot>	 !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1162.eqiad.wmnet
[05:38:47] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1161.eqiad.wmnet
[05:40:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P77963 and previous config saved to /var/cache/conftool/dbconfig/20250616-054037-marostegui.json
[05:41:42] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1161.eqiad.wmnet
[05:42:19] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1161.eqiad.wmnet
[05:42:51] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1161.eqiad.wmnet
[05:43:31] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1156191 (owner: 10Muehlenhoff)
[05:43:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] conftool: Allow es6 and es7 being set to read only via dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1153723 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup)
[05:44:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10916407 (10Stevemunene) Did thd raid config with  ` stevemunene@an-worker1160:~$ sudo perccli64 /c0 add vd each r0 wb r...
[05:46:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77964 and previous config saved to /var/cache/conftool/dbconfig/20250616-054656-root.json
[05:48:58] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 151326
[05:49:12] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 151326
[05:55:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P77965 and previous config saved to /var/cache/conftool/dbconfig/20250616-055545-marostegui.json
[05:56:58] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: Add group 7_8 remove group 9_10 hosts from cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159102 (https://phabricator.wikimedia.org/T390174)
[05:57:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10916421 (10Stevemunene)
[05:58:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10916422 (10Stevemunene) a:03Stevemunene
[05:58:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10916424 (10Stevemunene) a:03Stevemunene
[06:02:02] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#10916427 (10ayounsi)
[06:02:49] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#10916429 (10ayounsi) I added the #data-platform-sre tag to the task, I think @bking was recently working on those hosts.
[06:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:10:10] <wikibugs>	 (03PS4) 10Stevemunene: zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922)
[06:10:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T396130)', diff saved to https://phabricator.wikimedia.org/P77966 and previous config saved to /var/cache/conftool/dbconfig/20250616-061053-marostegui.json
[06:10:57] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[06:11:08] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[06:25:29] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[06:25:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T396130)', diff saved to https://phabricator.wikimedia.org/P77967 and previous config saved to /var/cache/conftool/dbconfig/20250616-062536-marostegui.json
[06:25:40] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[06:27:05] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: decommission cookbook: add support for decom spreadsheet - https://phabricator.wikimedia.org/T244315#10916607 (10Volans) 05Open→03Resolved a:03Volans Sounds good! Resolving this, happy to discuss further improvements whenever you want.
[06:31:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:41:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T396130)', diff saved to https://phabricator.wikimedia.org/P77968 and previous config saved to /var/cache/conftool/dbconfig/20250616-064117-marostegui.json
[06:41:22] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[06:46:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1156191 (owner: 10Muehlenhoff)
[06:47:14] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts install7001.wikimedia.org
[06:47:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:50:20] <logmsgbot>	 jmm@cumin1003 decommission (PID 1791650) is awaiting input
[06:52:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:53:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete keytab [labs/private] - 10https://gerrit.wikimedia.org/r/1159125
[06:54:21] <wikibugs>	 (03CR) 10Brouberol: zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene)
[06:54:54] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] hdfs: Add group 7_8 remove group 9_10 hosts from cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159102 (https://phabricator.wikimedia.org/T390174) (owner: 10Stevemunene)
[06:55:41] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[06:56:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77969 and previous config saved to /var/cache/conftool/dbconfig/20250616-065625-marostegui.json
[06:56:26] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete keytab [labs/private] - 10https://gerrit.wikimedia.org/r/1159125 (owner: 10Muehlenhoff)
[06:57:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Remove obsolete analytics_cluster::postgresql role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1155720 (https://phabricator.wikimedia.org/T395557) (owner: 10Btullis)
[06:57:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job squid in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:57:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:59:35] <wikibugs>	 (03PS9) 10Btullis: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378)
[06:59:50] <wikibugs>	 (03CR) 10Brouberol: "This now requires a chart version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[07:00:00] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin1003"
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T0700).
[07:00:05] <jouncebot>	 anzx and WMDE-Fisch: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:22] <WMDE-Fisch>	 o/
[07:00:24] <wikibugs>	 (03PS4) 10Brouberol: mediawiki: define a dumps suspended CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786)
[07:00:27] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin1003"
[07:00:28] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:00:28] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install7001.wikimedia.org
[07:00:40] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10916711 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `install7001.wikimedia.org` - install7001.wikimedia.org (**PA...
[07:01:36] <wikibugs>	 (03PS3) 10Brouberol: airflow: set a low priority on all airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107)
[07:02:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job squid in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:02:42] <wikibugs>	 (03PS4) 10Brouberol: airflow: set a low priority on all airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107)
[07:05:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2204 with weight 0 T396976', diff saved to https://phabricator.wikimedia.org/P77970 and previous config saved to /var/cache/conftool/dbconfig/20250616-070524-root.json
[07:05:29] <stashbot>	 T396976: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T396976
[07:05:42] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T396976
[07:05:42] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] monitoring services: add migration task T384214 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155619 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli)
[07:06:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1159072 (https://phabricator.wikimedia.org/T396976) (owner: 10Gerrit maintenance bot)
[07:06:29] <marostegui>	 moritzm: ok to merge?
[07:07:36] <WMDE-Fisch>	 Anyone here that can deploy? :-)
[07:08:59] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: set a low priority on all airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol)
[07:09:11] <WMDE-Fisch>	 Seems I'm not allowed access to that new tool :-/
[07:09:52] <moritzm>	 marostegui: sorry, yes please
[07:10:51] <marostegui>	 Doing it moritzm 
[07:11:07] <moritzm>	 thx
[07:11:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77971 and previous config saved to /var/cache/conftool/dbconfig/20250616-071132-marostegui.json
[07:13:03] <Amir1>	 WMDE-Fisch: you have someone to deploy the patch?
[07:13:20] <Amir1>	 I deploy it now
[07:13:33] <WMDE-Fisch>	 Nope, just wanted to poke adam but if you got a sec that would be nice
[07:13:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156741 (https://phabricator.wikimedia.org/T395871) (owner: 10Svantje Lilienthal)
[07:14:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:14:29] <wikibugs>	 (03Merged) 10jenkins-bot: Enable sub-referencing on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156741 (https://phabricator.wikimedia.org/T395871) (owner: 10Svantje Lilienthal)
[07:14:52] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1156741|Enable sub-referencing on test wiki (T395871)]]
[07:14:56] <stashbot>	 T395871: Enable sub-referencing on test wiki - https://phabricator.wikimedia.org/T395871
[07:15:03] <WMDE-Fisch>	 Amir1: Thx! Any idea why I'm not granted access to that deployment interface with my account although I've got deployment rights? 🤔
[07:15:26] <Amir1>	 I have no idea, I'd say poke Tyler
[07:15:41] <anzx>	 o/
[07:17:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:19:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Update Cumin alias for Docker registry [puppet] - 10https://gerrit.wikimedia.org/r/1159291 (https://phabricator.wikimedia.org/T390251)
[07:20:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1159293 (https://phabricator.wikimedia.org/T395557)
[07:20:52] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[07:20:54] <moritzm>	 WMDE-Fisch: you can request access to the Spiderpig access group at https://idm.wikimedia.org/permissions/
[07:21:20] <anzx>	 Amir1: I have one patch to add , will add it to calendar before you finish syncing above 
[07:21:23] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[07:21:40] <Amir1>	 sure, this is going to be slow I think 
[07:22:07] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10916761 (10MatthewVernon) 05Open→03Resolved I've had a look, and this system looks good to me know (right number of filesystems of the right size, puppet happy, `swift-reco...
[07:23:39] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] hdfs: Add group 7_8 remove group 9_10 hosts from cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159102 (https://phabricator.wikimedia.org/T390174) (owner: 10Stevemunene)
[07:24:01] <wikibugs>	 (03PS2) 10Anzx: IP cap lift for event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980)
[07:24:33] <wikibugs>	 (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159297 (https://phabricator.wikimedia.org/T396875)
[07:25:10] <wikibugs>	 (03PS3) 10Anzx: IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980)
[07:25:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980) (owner: 10Anzx)
[07:25:38] <anzx>	 Amir1: added patch 
[07:26:36] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence: ms-fe2015 is suffering intermittent errors on port 80 - https://phabricator.wikimedia.org/T396573#10916777 (10MatthewVernon) Thanks @Ladsgroup :)
[07:26:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T396130)', diff saved to https://phabricator.wikimedia.org/P77973 and previous config saved to /var/cache/conftool/dbconfig/20250616-072640-marostegui.json
[07:26:44] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[07:26:55] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[07:27:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T396130)', diff saved to https://phabricator.wikimedia.org/P77974 and previous config saved to /var/cache/conftool/dbconfig/20250616-072702-marostegui.json
[07:27:13] <WMDE-Fisch>	 Thx moritzm just requested access now :-)
[07:28:21] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[07:28:41] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1160.eqiad.wmnet
[07:28:59] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[07:29:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10916787 (10ops-monitoring-bot) Host an-worker1160.eqiad.wmnet rebooted by stevemunene@cumin1002 w...
[07:29:11] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[07:29:22] <Amir1>	 the image is still being built
[07:29:28] <marostegui>	 !log Starting s2 codfw failover from db2207 to db2204 - T396976
[07:29:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:31] <stashbot>	 T396976: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T396976
[07:29:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2204 to s2 primary T396976', diff saved to https://phabricator.wikimedia.org/P77975 and previous config saved to /var/cache/conftool/dbconfig/20250616-072955-root.json
[07:30:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2207 T396976', diff saved to https://phabricator.wikimedia.org/P77976 and previous config saved to /var/cache/conftool/dbconfig/20250616-073045-marostegui.json
[07:31:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:31:38] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2207.codfw.wmnet with reason: Maintenance
[07:33:51] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/1159291 (https://phabricator.wikimedia.org/T390251) (owner: 10Muehlenhoff)
[07:34:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:35:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) (owner: 10KartikMistry)
[07:35:49] <logmsgbot>	 !log ladsgroup@deploy1003 lilients, ladsgroup: Backport for [[gerrit:1156741|Enable sub-referencing on test wiki (T395871)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:35:53] <stashbot>	 T395871: Enable sub-referencing on test wiki - https://phabricator.wikimedia.org/T395871
[07:36:11] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet)
[07:36:21] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Although not assigned to any host I see the role is still there. Is is obsolete and to be removed or there is some maintenance and will re" [puppet] - 10https://gerrit.wikimedia.org/r/1159293 (https://phabricator.wikimedia.org/T395557) (owner: 10Muehlenhoff)
[07:36:25] <wikibugs>	 (03PS1) 10Marostegui: db2207: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159302 (https://phabricator.wikimedia.org/T396976)
[07:37:20] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[07:37:47] <Amir1>	 WMDE-Fisch: it's on test servers
[07:37:50] <Amir1>	 please test
[07:37:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2207: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159302 (https://phabricator.wikimedia.org/T396976) (owner: 10Marostegui)
[07:38:03] <wikibugs>	 (03PS9) 10Brouberol: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[07:38:05] <WMDE-Fisch>	 k
[07:38:32] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Switch lvs7001 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1159303 (https://phabricator.wikimedia.org/T396561)
[07:40:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:40:52] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159303 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[07:40:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:40:59] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:41:03] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:41:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:41:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:41:15] <wikibugs>	 (03PS1) 10Muehlenhoff: offboard-user: Run a shallow git clone [puppet] - 10https://gerrit.wikimedia.org/r/1159328
[07:41:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove SSH key for aarora [puppet] - 10https://gerrit.wikimedia.org/r/1159329
[07:41:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1175 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:41:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:42:08] <WMDE-Fisch>	 Amir1: Hmm I don't get it on the test server. But it should be working. Might be caching involved...
[07:42:18] <WMDE-Fisch>	 Please go on.
[07:42:25] <logmsgbot>	 !log ladsgroup@deploy1003 lilients, ladsgroup: Continuing with sync
[07:42:29] <Amir1>	 okay
[07:43:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:43:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T396130)', diff saved to https://phabricator.wikimedia.org/P77977 and previous config saved to /var/cache/conftool/dbconfig/20250616-074346-marostegui.json
[07:43:52] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[07:44:07] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1160.eqiad.wmnet
[07:44:34] <wikibugs>	 (03PS10) 10Brouberol: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[07:45:21] <WMDE-Fisch>	 Ah now it's working an the test servers so all good. :-)
[07:45:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:45:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:45:38] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1161.eqiad.wmnet
[07:46:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10916837 (10ops-monitoring-bot) Host an-worker1161.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting...
[07:47:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] thanos: add memcached-based index caching to store [puppet] - 10https://gerrit.wikimedia.org/r/1156341 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi)
[07:47:25] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1159328 (owner: 10Muehlenhoff)
[07:47:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] thanos: trial store memcache on titan[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/1156342 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi)
[07:47:36] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10916847 (10Anton.Kokh) @KFrancis thank you, I just signed it!
[07:48:59] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:49:12] <wikibugs>	 (03PS1) 10Vgutierrez: Revert^3 "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1159351
[07:49:40] <wikibugs>	 (03PS2) 10Muehlenhoff: offboard-user: Run a shallow git clone [puppet] - 10https://gerrit.wikimedia.org/r/1159328
[07:50:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:50:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] offboard-user: Run a shallow git clone [puppet] - 10https://gerrit.wikimedia.org/r/1159328 (owner: 10Muehlenhoff)
[07:51:11] <wikibugs>	 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10916852 (10Vgutierrez) 05Resolved→03Open acme-chief is still unable to issue certificates for this domain: `lang=json {   "identifier": {     "type": "dns",     "value": "pywikipe...
[07:51:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:51:19] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:51:31] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert^3 "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1159351 (owner: 10Vgutierrez)
[07:51:59] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:53:20] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1161.eqiad.wmnet
[07:53:54] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[07:54:01] <wikibugs>	 (03CR) 10Ladsgroup: "Yeah, I can run the script on all wikis to clean them up." [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[07:54:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:54:29] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[07:55:04] <logmsgbot>	 !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1149-1153].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 9 and 10
[07:55:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10916871 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1e2de4df-1e1e-43b0-ba8...
[07:55:17] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply
[07:55:44] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156741|Enable sub-referencing on test wiki (T395871)]] (duration: 40m 51s)
[07:55:47] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply
[07:55:47] <stashbot>	 T395871: Enable sub-referencing on test wiki - https://phabricator.wikimedia.org/T395871
[07:55:51] <logmsgbot>	 !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1175-1176].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 9 and 10
[07:55:53] <Amir1>	 WMDE-Fisch: deployed
[07:55:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10916875 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=95b11ba7-0512-4582-810...
[07:56:17] <godog>	 jouncebot: now and next
[07:56:17] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T0700)
[07:56:25] <anzx>	 Amir1: mine both can sync at once
[07:56:29] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1162.eqiad.wmnet
[07:56:39] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[07:56:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10916880 (10ops-monitoring-bot) Host an-worker1162.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting...
[07:57:21] <Amir1>	 anzx: the to is wrong it's in the past https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1159292/3/wmf-config/throttle.php
[07:57:25] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[07:57:26] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mrwiki: add मसूदा (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) (owner: 10Anzx)
[07:57:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) (owner: 10Anzx)
[07:58:02] <wikibugs>	 (03PS4) 10Anzx: IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980)
[07:58:03] <Amir1>	 I deploy the mrwiki patch, it should be much faster now
[07:58:31] <anzx>	 Amir1: thanks fixed date
[07:58:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77978 and previous config saved to /var/cache/conftool/dbconfig/20250616-075855-marostegui.json
[07:59:01] <wikibugs>	 (03Merged) 10jenkins-bot: mrwiki: add मसूदा (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) (owner: 10Anzx)
[07:59:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263)
[07:59:16] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1156092|mrwiki: add मसूदा (draft) namespace (T396551)]]
[07:59:20] <stashbot>	 T396551: Add new namespace मसूदा on mrwiki (with specific edit/move group restrictions) - https://phabricator.wikimedia.org/T396551
[07:59:39] <wikibugs>	 (03PS2) 10Muehlenhoff: Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263)
[07:59:46] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[08:00:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:00:20] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[08:01:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:02:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:02:38] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:03:10] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[08:03:28] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "good job and godspeed!" [puppet] - 10https://gerrit.wikimedia.org/r/1159303 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[08:03:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: write SPDX header to stack config on save [puppet] - 10https://gerrit.wikimedia.org/r/1156781 (owner: 10Filippo Giunchedi)
[08:03:47] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup, anzx: Backport for [[gerrit:1156092|mrwiki: add मसूदा (draft) namespace (T396551)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:03:52] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[08:04:01] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1162.eqiad.wmnet
[08:04:26] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[08:04:48] <anzx>	 checking 
[08:04:58] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[08:04:58] <wikibugs>	 (03CR) 10Jelto: miscweb: add os-reports update mechanism (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[08:05:11] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:13] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1157.eqiad.wmnet
[08:05:33] <anzx>	 Amir1: namespace appears, ok to continue 
[08:05:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10916910 (10ops-monitoring-bot) Host an-worker1157.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting...
[08:05:38] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup, anzx: Continuing with sync
[08:06:18] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply
[08:06:48] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply
[08:08:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1175 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:09:53] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff)
[08:10:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:13:08] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1157.eqiad.wmnet
[08:13:26] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1158.eqiad.wmnet
[08:13:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10916918 (10ops-monitoring-bot) Host an-worker1158.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting...
[08:14:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77979 and previous config saved to /var/cache/conftool/dbconfig/20250616-081402-marostegui.json
[08:14:07] <wikibugs>	 (03PS3) 10Filippo Giunchedi: thanos: enforce series limit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1155190 (https://phabricator.wikimedia.org/T394318)
[08:14:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: enable tracing for rule [puppet] - 10https://gerrit.wikimedia.org/r/1159355 (https://phabricator.wikimedia.org/T394318)
[08:14:28] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156092|mrwiki: add मसूदा (draft) namespace (T396551)]] (duration: 15m 11s)
[08:14:32] <stashbot>	 T396551: Add new namespace मसूदा on mrwiki (with specific edit/move group restrictions) - https://phabricator.wikimedia.org/T396551
[08:14:41] <anzx>	 Amir1: please  run namespaceDupes.php for mrwiki
[08:15:04] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: Switch canaries to 0.1% OpenTelemetry sampling [puppet] - 10https://gerrit.wikimedia.org/r/984814 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris)
[08:15:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Remove old docker_registry_ha hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/1156762 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris)
[08:16:26] <Amir1>	 I will
[08:16:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] "Sigh, missed that in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1154302, I only removed the old profile and didn't rename this o" [puppet] - 10https://gerrit.wikimedia.org/r/1159291 (https://phabricator.wikimedia.org/T390251) (owner: 10Muehlenhoff)
[08:17:09] <wikibugs>	 (03CR) 10Silvan Heintze: [C:03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159297 (https://phabricator.wikimedia.org/T396875) (owner: 10Jakob)
[08:17:41] <wikibugs>	 (03PS3) 10Muehlenhoff: Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263)
[08:18:05] <wikibugs>	 (03PS4) 10Muehlenhoff: Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263)
[08:18:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980) (owner: 10Anzx)
[08:18:52] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:18:55] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff)
[08:19:07] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin depooling P{lvs7001.magru.wmnet} and A:liberica (T396561)
[08:19:11] <stashbot>	 T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561
[08:19:15] <wikibugs>	 (03Merged) 10jenkins-bot: IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980) (owner: 10Anzx)
[08:19:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77980 and previous config saved to /var/cache/conftool/dbconfig/20250616-081922-root.json
[08:19:30] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1159292|IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 (T396980)]]
[08:19:34] <stashbot>	 T396980: Lift IP cap on 2025-06-19 for Wikipedia workshop - cs.wikipedia - https://phabricator.wikimedia.org/T396980
[08:19:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff)
[08:20:30] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7001.magru.wmnet} and A:liberica (T396561)
[08:20:44] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1158.eqiad.wmnet
[08:21:20] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs7001 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1159303 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[08:21:21] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1159.eqiad.wmnet
[08:21:24] <logmsgbot>	 !log ladsgroup@deploy1003 anzx, ladsgroup: Backport for [[gerrit:1159292|IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 (T396980)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:21:43] <anzx>	 Amir1: nothing to test , ok to sync
[08:21:52] <Amir1>	 yup
[08:21:52] <vgutierrez>	 moritzm: ok to merge Reimage ganeti7003 with insetup role (368d9a4b17)?
[08:21:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10916951 (10ops-monitoring-bot) Host an-worker1159.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting...
[08:22:42] <logmsgbot>	 !log ladsgroup@deploy1003 anzx, ladsgroup: Continuing with sync
[08:23:24] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: Rename docker_registry_ha's occurrences to docker_registry [labs/private] - 10https://gerrit.wikimedia.org/r/1155601 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey)
[08:27:52] <wikibugs>	 (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159297 (https://phabricator.wikimedia.org/T396875) (owner: 10Jakob)
[08:28:05] <wikibugs>	 (03PS1) 10Marostegui: db1254: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159357 (https://phabricator.wikimedia.org/T396549)
[08:28:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1254 T396549', diff saved to https://phabricator.wikimedia.org/P77981 and previous config saved to /var/cache/conftool/dbconfig/20250616-082841-marostegui.json
[08:28:46] <stashbot>	 marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[08:28:46] <stashbot>	 T396549: Migrate s2 to MariaDB 10.11 - https://phabricator.wikimedia.org/T396549
[08:28:47] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1254.eqiad.wmnet with reason: Maintenance
[08:28:48] <wikibugs>	 (03CR) 10Btullis: [C:03+1] mediawiki: define a dumps suspended CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[08:29:02] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1159.eqiad.wmnet
[08:29:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T396130)', diff saved to https://phabricator.wikimedia.org/P77982 and previous config saved to /var/cache/conftool/dbconfig/20250616-082910-marostegui.json
[08:29:14] <wikibugs>	 (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: allow the airflow service account to query CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156830 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[08:29:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1254: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159357 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui)
[08:29:14] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[08:29:25] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[08:29:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T396130)', diff saved to https://phabricator.wikimedia.org/P77983 and previous config saved to /var/cache/conftool/dbconfig/20250616-082933-marostegui.json
[08:29:44] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159292|IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 (T396980)]] (duration: 10m 13s)
[08:29:48] <stashbot>	 T396980: Lift IP cap on 2025-06-19 for Wikipedia workshop - cs.wikipedia - https://phabricator.wikimedia.org/T396980
[08:29:52] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159297 (https://phabricator.wikimedia.org/T396875) (owner: 10Jakob)
[08:30:00] <anzx>	 Amir1: Thanks for deploying & please run `mwscript-k8s --comment='T396980' --follow resetAuthenticationThrottle.php --wiki=cswiki --signup --ip 78.128.191.240` https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold
[08:30:20] <logmsgbot>	 !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[08:30:30] <Amir1>	 sure
[08:30:35] <logmsgbot>	 !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[08:31:06] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:31:10] <Amir1>	 done
[08:31:22] <logmsgbot>	 !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[08:31:25] <anzx>	 thanks
[08:31:27] <logmsgbot>	 !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs7001.magru.wmnet with reason: switching to katran
[08:31:38] <logmsgbot>	 !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[08:32:06] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:32:17] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10917019 (10MoritzMuehlenhoff)
[08:32:23] <Amir1>	 https://www.irccloud.com/pastebin/8FQ8lWSz/
[08:32:28] <Amir1>	 mrwiki
[08:32:34] <Amir1>	 anzx: ^
[08:33:01] <anzx>	 thanks
[08:33:04] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:34:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1254 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77984 and previous config saved to /var/cache/conftool/dbconfig/20250616-083419-root.json
[08:34:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77985 and previous config saved to /var/cache/conftool/dbconfig/20250616-083428-root.json
[08:35:22] <logmsgbot>	 !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[08:35:38] <logmsgbot>	 !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[08:36:05] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti7003.magru.wmnet with OS bookworm
[08:37:04] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:38:22] <wikibugs>	 (03PS1) 10Majavah: policies: Rename cr-labs -> cr-cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1159360
[08:39:45] <wikibugs>	 (03PS4) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824)
[08:40:52] <wikibugs>	 (03Abandoned) 10Ayounsi: Rename labs and cloud filters [homer/public] - 10https://gerrit.wikimedia.org/r/767476 (owner: 10Ayounsi)
[08:42:14] <wikibugs>	 (03PS5) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824)
[08:43:48] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1159360 (owner: 10Majavah)
[08:43:50] <icinga-wm>	 RECOVERY - Disk space on stat1011 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1011&var-datasource=eqiad+prometheus/ops
[08:44:06] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner1003.eqiad.wmnet with OS bookworm
[08:44:49] <wikibugs>	 (03CR) 10Majavah: [C:03+2] policies: Rename cr-labs -> cr-cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1159360 (owner: 10Majavah)
[08:45:22] <wikibugs>	 (03Merged) 10jenkins-bot: policies: Rename cr-labs -> cr-cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1159360 (owner: 10Majavah)
[08:46:43] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli)
[08:47:24] <wikibugs>	 (03CR) 10Aqu: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[08:48:37] <taavi>	 !log cr policy: rename cr-labs to cr-cloud-hosts (https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1159360)
[08:48:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T396130)', diff saved to https://phabricator.wikimedia.org/P77986 and previous config saved to /var/cache/conftool/dbconfig/20250616-084907-marostegui.json
[08:49:11] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[08:49:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1254 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77987 and previous config saved to /var/cache/conftool/dbconfig/20250616-084925-root.json
[08:49:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77988 and previous config saved to /var/cache/conftool/dbconfig/20250616-084934-root.json
[08:50:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:51:37] <zabe>	 !log zabe@deploy1003:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php wikidatawiki --delete /home/zabe/text_table_cleanup/wikidatawiki --sleep 0.5 # T183490
[08:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:42] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[08:54:33] <vgutierrez>	 !log depooling ncredir7003
[08:54:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:40] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7003.magru.wmnet with reason: host reimage
[08:58:57] <vgutierrez>	 !log repool ncredir7003
[08:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:27] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage
[09:00:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Add db1252', diff saved to https://phabricator.wikimedia.org/P77989 and previous config saved to /var/cache/conftool/dbconfig/20250616-090058-fceratto.json
[09:01:38] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7003.magru.wmnet with reason: host reimage
[09:02:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10917315 (10Stevemunene) 05Open→03Resolved Hosts are back online rejoining the cluster {F62348242}
[09:03:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10917321 (10Stevemunene)
[09:04:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10917325 (10Stevemunene) 05Open→03Resolved Hosts are back online and rejoining the cluster {F62348266}
[09:04:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77990 and previous config saved to /var/cache/conftool/dbconfig/20250616-090414-marostegui.json
[09:04:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10917332 (10Stevemunene)
[09:04:26] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:04:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1254 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77991 and previous config saved to /var/cache/conftool/dbconfig/20250616-090431-root.json
[09:04:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage
[09:04:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77992 and previous config saved to /var/cache/conftool/dbconfig/20250616-090439-root.json
[09:06:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004 (10WMDE-leszek) 03NEW
[09:07:24] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10917359 (10WMDE-leszek) I figure @AndyRussG_volunteer also needs to be added to `nda` LDAP group. I believe their account has been there, so maybe there's still a trace of N...
[09:10:20] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10917362 (10WMDE-leszek) Me having opened this request does indicate that I approve this request on WMDE's end.
[09:10:47] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1252* slowly with 10 steps - Pooling in
[09:11:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10917364 (10WMDE-leszek)
[09:12:33] <logmsgbot>	 !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1252* slowly with 10 steps - Pooling in
[09:14:59] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1252* slowly with 10 steps - Pooling in
[09:18:24] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Repool lvs7001 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1159374 (https://phabricator.wikimedia.org/T396561)
[09:18:46] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159374 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[09:19:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77995 and previous config saved to /var/cache/conftool/dbconfig/20250616-091921-marostegui.json
[09:19:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1254 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77996 and previous config saved to /var/cache/conftool/dbconfig/20250616-091936-root.json
[09:20:54] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[09:23:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1003.eqiad.wmnet with OS bookworm
[09:23:27] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7003.magru.wmnet with OS bookworm
[09:26:11] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs7001 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1159374 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez)
[09:26:51] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs7001.magru.wmnet
[09:26:52] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs7001.magru.wmnet
[09:27:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hadoop: remove check_procs based alerts in favor of SystemdUnitFailed [puppet] - 10https://gerrit.wikimedia.org/r/1159385 (https://phabricator.wikimedia.org/T357099)
[09:30:59] <vgutierrez>	 !log repool lvs7001 using katran as forwarding plane - T396561
[09:31:03] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs7001.magru.wmnet} and A:liberica
[09:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:03] <stashbot>	 T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561
[09:31:21] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs7001.magru.wmnet} and A:liberica
[09:31:52] <zabe>	 !log zabe@deploy1003:~$ mwscript extensions/AbuseFilter/maintenance/MigrateESRefToAflTable.php wikidatawiki --deletedump /home/zabe/afl_text_table_deletedump/wikidatawiki --dump /home/zabe/afl_text_table_dump/wikidatawiki --sleep 0.5 # T381599
[09:31:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:56] <stashbot>	 T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599
[09:34:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T396130)', diff saved to https://phabricator.wikimedia.org/P77998 and previous config saved to /var/cache/conftool/dbconfig/20250616-093429-marostegui.json
[09:34:33] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[09:34:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1254 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77999 and previous config saved to /var/cache/conftool/dbconfig/20250616-093442-root.json
[09:34:44] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[09:34:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T396130)', diff saved to https://phabricator.wikimedia.org/P78000 and previous config saved to /var/cache/conftool/dbconfig/20250616-093451-marostegui.json
[09:36:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: move blackbox-exporter to log prober errors [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022)
[09:37:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Toolforge seems to be using 0.26, but the metricsinfra servers are still on bullseye / 0.18.0+ds-3+b2." [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi)
[09:37:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: move blackbox-exporter to log prober errors [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi)
[09:39:03] <wikibugs>	 (03PS1) 10Zabe: wikidatawiki: Increase revision-slots cache back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159388 (https://phabricator.wikimedia.org/T183490)
[09:40:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: keep netbox-hiera updated [puppet] - 10https://gerrit.wikimedia.org/r/1159389
[09:41:53] <wikibugs>	 (03PS1) 10Ayounsi: Routed Ganeti: disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263)
[09:42:08] <zabe>	 jouncebot: nowandnext
[09:42:08] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 17 minute(s)
[09:42:08] <jouncebot>	 In 0 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1000)
[09:42:46] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Stop setting $wgPageLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1158804 (https://phabricator.wikimedia.org/T299947) (owner: 10Zabe)
[09:42:54] <wikibugs>	 (03CR) 10Zabe: [C:03+2] wikidatawiki: Increase revision-slots cache back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159388 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[09:44:09] <moritzm>	 !log remove magru01 in Netbox (all Ganeti nodes have been removed from it) T394263
[09:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:13] <stashbot>	 T394263: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263
[09:44:23] <wikibugs>	 (03Merged) 10jenkins-bot: Stop setting $wgPageLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1158804 (https://phabricator.wikimedia.org/T299947) (owner: 10Zabe)
[09:44:26] <wikibugs>	 (03Merged) 10jenkins-bot: wikidatawiki: Increase revision-slots cache back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159388 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[09:45:07] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1159388|wikidatawiki: Increase revision-slots cache back to default (T183490)]], [[gerrit:1158804|Stop setting $wgPageLinksSchemaMigrationStage (T299947)]]
[09:45:12] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[09:45:12] <stashbot>	 T299947: Normalize pagelinks table - https://phabricator.wikimedia.org/T299947
[09:45:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove ganeti7003 - jmm@cumin2002"
[09:45:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove ganeti7003 - jmm@cumin2002"
[09:46:23] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: reimage: check for Monitoring::Host in puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/1156264 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi)
[09:46:28] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: monitoring: add note about reimage cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1156265 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi)
[09:46:31] <wikibugs>	 (03CR) 10Vgutierrez: Routed Ganeti: disable rp_filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi)
[09:47:00] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1159388|wikidatawiki: Increase revision-slots cache back to default (T183490)]], [[gerrit:1158804|Stop setting $wgPageLinksSchemaMigrationStage (T299947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:47:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:49:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:51:00] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[09:51:00] <wikibugs>	 (03PS2) 10Ayounsi: Routed Ganeti: disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263)
[09:51:06] <wikibugs>	 (03CR) 10Ayounsi: Routed Ganeti: disable rp_filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi)
[09:51:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:51:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T396130)', diff saved to https://phabricator.wikimedia.org/P78002 and previous config saved to /var/cache/conftool/dbconfig/20250616-095135-marostegui.json
[09:51:39] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[09:53:12] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi)
[09:54:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:57:53] <wikibugs>	 (03PS3) 10Ayounsi: Routed Ganeti: disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263)
[09:57:54] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159388|wikidatawiki: Increase revision-slots cache back to default (T183490)]], [[gerrit:1158804|Stop setting $wgPageLinksSchemaMigrationStage (T299947)]] (duration: 12m 46s)
[09:57:58] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[09:57:59] <stashbot>	 T299947: Normalize pagelinks table - https://phabricator.wikimedia.org/T299947
[09:58:35] <wikibugs>	 (03PS4) 10Ayounsi: Routed Ganeti: disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263)
[09:58:47] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi)
[09:59:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove magru01 cluster - jmm@cumin2002"
[09:59:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove magru01 cluster - jmm@cumin2002"
[09:59:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1000)
[10:02:47] <wikibugs>	 (03PS1) 10Marostegui: db1246: Install MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159397 (https://phabricator.wikimedia.org/T396549)
[10:03:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1246: Install MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159397 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui)
[10:04:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi)
[10:04:50] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440)
[10:05:14] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance
[10:05:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T395241)', diff saved to https://phabricator.wikimedia.org/P78005 and previous config saved to /var/cache/conftool/dbconfig/20250616-100521-fceratto.json
[10:06:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P78006 and previous config saved to /var/cache/conftool/dbconfig/20250616-100642-marostegui.json
[10:07:42] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi)
[10:08:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] rest-gateway: route html<->wikitext transforms to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156811 (https://phabricator.wikimedia.org/T396856) (owner: 10Hnowlan)
[10:10:37] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: route html<->wikitext transforms to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156811 (https://phabricator.wikimedia.org/T396856) (owner: 10Hnowlan)
[10:11:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti7003 to the routed Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159398 (https://phabricator.wikimedia.org/T394263)
[10:12:00] <wikibugs>	 (03PS5) 10Jgiannelos: changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418)
[10:12:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T395241)', diff saved to https://phabricator.wikimedia.org/P78008 and previous config saved to /var/cache/conftool/dbconfig/20250616-101244-fceratto.json
[10:15:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: bird: remove check_anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842)
[10:15:55] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:16:04] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:16:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Might not be needed after all, see Iff23cb1941ca3b0" [puppet] - 10https://gerrit.wikimedia.org/r/1155142 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli)
[10:18:06] <wikibugs>	 (03CR) 10Jgiannelos: changeprop: Remove rules related to parsoid (RB sunset) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos)
[10:20:06] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos)
[10:21:25] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::cloudgw: Install natlog on cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734)
[10:21:31] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos)
[10:21:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P78010 and previous config saved to /var/cache/conftool/dbconfig/20250616-102150-marostegui.json
[10:22:52] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[10:23:27] <wikibugs>	 (03CR) 10Gmodena: [C:03+2] dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156821 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena)
[10:23:29] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos)
[10:24:33] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::cloudgw: Install natlog on cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734)
[10:25:12] <wikibugs>	 (03Merged) 10jenkins-bot: dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156821 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena)
[10:25:12] <moritzm>	 !log installing qemu security updates
[10:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:01] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[10:26:34] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[10:27:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P78011 and previous config saved to /var/cache/conftool/dbconfig/20250616-102752-fceratto.json
[10:28:34] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[10:28:37] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[10:28:39] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::cloudgw: Install natlog on cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734)
[10:28:43] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[10:28:46] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[10:29:04] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[10:29:16] <claime>	 !log Manual run job.batch/update-special-pages-s8-manual-202506161028 started - T396977
[10:29:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:20] <stashbot>	 T396977: MediaWiki periodic job update-special-pages-s8 failed - https://phabricator.wikimedia.org/T396977
[10:29:34] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[10:29:36] <wikibugs>	 (03PS1) 10Marostegui: db1229: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159403 (https://phabricator.wikimedia.org/T396549)
[10:29:46] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[10:29:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1229 T396549', diff saved to https://phabricator.wikimedia.org/P78012 and previous config saved to /var/cache/conftool/dbconfig/20250616-102949-marostegui.json
[10:29:52] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[10:29:54] <stashbot>	 T396549: Migrate s2 to MariaDB 10.11 - https://phabricator.wikimedia.org/T396549
[10:29:57] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1229.eqiad.wmnet with reason: Maintenance
[10:30:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove SSH key for aarora [puppet] - 10https://gerrit.wikimedia.org/r/1159329 (owner: 10Muehlenhoff)
[10:30:15] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[10:30:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[10:31:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1229: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159403 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui)
[10:31:39] <wikibugs>	 (03PS1) 10Muehlenhoff: cross-validate-accounts: Run a shallow git clone [puppet] - 10https://gerrit.wikimedia.org/r/1159404
[10:31:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:32:14] <wikibugs>	 (03CR) 10David Caro: "This broke cloud instances:" [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli)
[10:34:26] <wikibugs>	 (03PS1) 10David Caro: cloud.yaml: add missing profile::puppetdb::pdb_resource_exporter_config [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442)
[10:34:50] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[10:34:53] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[10:35:06] <wikibugs>	 (03CR) 10David Caro: "Fix here https://gerrit.wikimedia.org/r/c/operations/puppet/+/1159406" [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli)
[10:36:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T396130)', diff saved to https://phabricator.wikimedia.org/P78014 and previous config saved to /var/cache/conftool/dbconfig/20250616-103657-marostegui.json
[10:37:02] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[10:37:13] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[10:37:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T396130)', diff saved to https://phabricator.wikimedia.org/P78015 and previous config saved to /var/cache/conftool/dbconfig/20250616-103720-marostegui.json
[10:37:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] cloud.yaml: add missing profile::puppetdb::pdb_resource_exporter_config [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) (owner: 10David Caro)
[10:43:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P78016 and previous config saved to /var/cache/conftool/dbconfig/20250616-104259-fceratto.json
[10:43:23] <wikibugs>	 (03PS2) 10David Caro: cloud.yaml: add missing profile::puppetdb::pdb_resource_exporter_config [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442)
[10:44:16] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) (owner: 10David Caro)
[10:44:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) (owner: 10David Caro)
[10:47:13] <wikibugs>	 (03PS3) 10David Caro: puppetdb: allow making the exporter config null [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442)
[10:48:54] <wikibugs>	 (03PS4) 10David Caro: puppetdb: allow making the exporter config null [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442)
[10:49:02] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) (owner: 10David Caro)
[10:51:53] <wikibugs>	 (03CR) 10David Caro: [C:03+2] puppetdb: allow making the exporter config null [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) (owner: 10David Caro)
[10:53:50] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10917731 (10Clement_Goubert)
[10:53:50] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:53:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T396130)', diff saved to https://phabricator.wikimedia.org/P78018 and previous config saved to /var/cache/conftool/dbconfig/20250616-105353-marostegui.json
[10:53:56] <hnowlan>	 jouncebot: nowandnext
[10:53:56] <jouncebot>	 For the next 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1000)
[10:53:56] <jouncebot>	 In 2 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1300)
[10:53:58] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[10:54:04] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:55:04] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:55:12] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:56:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78019 and previous config saved to /var/cache/conftool/dbconfig/20250616-105621-root.json
[10:57:42] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:58:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T395241)', diff saved to https://phabricator.wikimedia.org/P78020 and previous config saved to /var/cache/conftool/dbconfig/20250616-105806-fceratto.json
[11:01:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ganeti7003 to the routed Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159398 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff)
[11:08:07] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet
[11:09:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P78022 and previous config saved to /var/cache/conftool/dbconfig/20250616-110901-marostegui.json
[11:11:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78023 and previous config saved to /var/cache/conftool/dbconfig/20250616-111127-root.json
[11:14:56] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-backup2002.codfw.wmnet with reason: Maintenance and reboot
[11:15:05] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1159404 (owner: 10Muehlenhoff)
[11:15:54] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1252* slowly with 10 steps - Pooling in
[11:19:55] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet
[11:21:56] <wikibugs>	 (03CR) 10Brouberol: "Adding a core SRE to the patch as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[11:24:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P78026 and previous config saved to /var/cache/conftool/dbconfig/20250616-112408-marostegui.json
[11:26:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78027 and previous config saved to /var/cache/conftool/dbconfig/20250616-112633-root.json
[11:30:50] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] trafficserver: migrate html<->wikitext transforms out of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1156813 (https://phabricator.wikimedia.org/T396856) (owner: 10Hnowlan)
[11:34:01] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[11:34:21] <wikibugs>	 (03PS1) 10Cathal Mooney: Rename cloud-in to cloud-vrf-in [homer/public] - 10https://gerrit.wikimedia.org/r/1159415
[11:37:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner1004.eqiad.wmnet with OS bookworm
[11:38:45] <wikibugs>	 10ops-magru: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T390258#10917858 (10cmooney) >>! In T390258#10911945, @ayounsi wrote: > Looking at Mar 28 2025, there seems like there was some small events, but nothing worth investigating, we can close that for now.  Yep agreed.  >>! In T390258#10910...
[11:38:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Apply ncredir role to ncredir7004 [puppet] - 10https://gerrit.wikimedia.org/r/1156814 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff)
[11:39:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T396130)', diff saved to https://phabricator.wikimedia.org/P78028 and previous config saved to /var/cache/conftool/dbconfig/20250616-113915-marostegui.json
[11:39:20] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[11:39:31] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[11:39:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T396130)', diff saved to https://phabricator.wikimedia.org/P78029 and previous config saved to /var/cache/conftool/dbconfig/20250616-113938-marostegui.json
[11:40:34] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[11:41:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78030 and previous config saved to /var/cache/conftool/dbconfig/20250616-114138-root.json
[11:43:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:43:59] <wikibugs>	 (03CR) 10Majavah: [C:04-1] Rename cloud-in to cloud-vrf-in (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 (owner: 10Cathal Mooney)
[11:44:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] cross-validate-accounts: Run a shallow git clone [puppet] - 10https://gerrit.wikimedia.org/r/1159404 (owner: 10Muehlenhoff)
[11:45:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki: define a dumps suspended CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[11:45:42] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: allow the airflow service account to query CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156830 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[11:50:07] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[11:50:13] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[11:50:14] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti7003.magru.wmnet to cluster magru03 and group B
[11:51:21] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7003.magru.wmnet to cluster magru03 and group B
[11:54:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage
[11:54:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T396130)', diff saved to https://phabricator.wikimedia.org/P78031 and previous config saved to /var/cache/conftool/dbconfig/20250616-115417-marostegui.json
[11:54:22] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[11:56:46] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7003.magru.wmnet to drbd
[11:57:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage
[11:57:19] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10917964 (10ops-monitoring-bot) VM durum7003.magru.wmnet switching disk type to drbd
[12:02:30] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10917977 (10Stevemunene)
[12:03:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10917978 (10Stevemunene)
[12:03:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10917979 (10Stevemunene)
[12:06:57] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7003.magru.wmnet to drbd
[12:07:13] <icinga-wm>	 PROBLEM - Host durum7003 is DOWN: PING CRITICAL - Packet loss = 100%
[12:08:13] <icinga-wm>	 RECOVERY - Host durum7003 is UP: PING OK - Packet loss = 0%, RTA = 115.64 ms
[12:09:13] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7003 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[12:09:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P78033 and previous config saved to /var/cache/conftool/dbconfig/20250616-120924-marostegui.json
[12:09:49] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum7003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[12:10:49] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum7003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[12:11:13] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7003 is OK: OK: UP (pid=2398) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[12:11:42] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Very slow data transfers during migrations affecting ganeti1047/ganeti1048 - https://phabricator.wikimedia.org/T397025 (10MoritzMuehlenhoff) 03NEW
[12:11:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Very slow data transfers during migrations affecting ganeti1047/ganeti1048 - https://phabricator.wikimedia.org/T397025#10917999 (10MoritzMuehlenhoff) p:05Triage→03Medium
[12:15:27] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1004.eqiad.wmnet with OS bookworm
[12:17:41] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7003.wikimedia.org to drbd
[12:18:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2080.codfw.wmnet with OS bullseye
[12:18:14] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#10918024 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2080.codfw.wm...
[12:18:19] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye
[12:18:32] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2080
[12:18:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[12:19:59] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev ceph: cloudcephmons -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1159425 (https://phabricator.wikimedia.org/T309789)
[12:20:41] <logmsgbot>	 jmm@cumin1003 changedisk (PID 1825196) is awaiting input
[12:22:49] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] pontoon: keep netbox-hiera updated [puppet] - 10https://gerrit.wikimedia.org/r/1159389 (owner: 10Filippo Giunchedi)
[12:24:21] <mszabo>	 jouncebot: nowandnext
[12:24:21] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 35 minute(s)
[12:24:21] <jouncebot>	 In 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1300)
[12:24:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P78034 and previous config saved to /var/cache/conftool/dbconfig/20250616-122432-marostegui.json
[12:24:47] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10918046 (10ops-monitoring-bot) VM doh7003.wikimedia.org switching disk type to drbd
[12:24:52] <wikibugs>	 (03PS1) 10Máté Szabó: Add missing labels for email confirmation reminder preferences [extensions/Echo] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159438 (https://phabricator.wikimedia.org/T58074)
[12:24:58] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2080 - mvernon@cumin2002"
[12:25:03] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2080 - mvernon@cumin2002"
[12:25:03] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:25:04] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2080.codfw.wmnet 245.48.192.10.in-addr.arpa 5.4.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:25:07] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2080.codfw.wmnet 245.48.192.10.in-addr.arpa 5.4.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:25:08] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2080
[12:25:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Echo] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159438 (https://phabricator.wikimedia.org/T58074) (owner: 10Máté Szabó)
[12:25:21] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2080
[12:25:21] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2080
[12:25:47] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene)
[12:27:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: keep netbox-hiera updated [puppet] - 10https://gerrit.wikimedia.org/r/1159389 (owner: 10Filippo Giunchedi)
[12:30:13] <godog>	 jouncebot: now and next
[12:30:13] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 29 minute(s)
[12:30:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:34:12] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[12:34:14] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7003.wikimedia.org to drbd
[12:34:18] <icinga-wm>	 PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100%
[12:34:40] <icinga-wm>	 RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 115.64 ms
[12:34:50] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7003 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[12:35:18] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh7003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[12:35:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:37:45] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[12:37:50] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7003 is OK: OK: UP (pid=2336) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[12:38:18] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh7003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[12:39:37] <wikibugs>	 (03CR) 10Majavah: "Hmm, the rules in `alerts.git:team-traffic/anycast_healthchecker.yaml` are for traffic roles only so this is effectively removing alerting" [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi)
[12:39:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T396130)', diff saved to https://phabricator.wikimedia.org/P78035 and previous config saved to /var/cache/conftool/dbconfig/20250616-123939-marostegui.json
[12:39:45] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Advance snapshot dbbackups start time by 4 hours [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380)
[12:39:46] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[12:39:56] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[12:40:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T396130)', diff saved to https://phabricator.wikimedia.org/P78036 and previous config saved to /var/cache/conftool/dbconfig/20250616-124002-marostegui.json
[12:40:32] <wikibugs>	 (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo)
[12:41:07] <wikibugs>	 (03CR) 10Elukey: [C:03+1] phabricator: expand support for Phabricator tasks (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 (owner: 10Volans)
[12:41:41] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Advance snapshot dbbackups start time by 4 hours [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380)
[12:42:57] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2080.codfw.wmnet with reason: host reimage
[12:45:15] <wikibugs>	 (03CR) 10Jcrespo: "Snapshots take ~11 hours to complete ATM." [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo)
[12:46:42] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2080.codfw.wmnet with reason: host reimage
[12:48:00] <hnowlan>	 jouncebot: nowandnext
[12:48:00] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 11 minute(s)
[12:48:00] <jouncebot>	 In 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1300)
[12:50:11] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:50:55] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:51:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Good point, we can certainly extend/duplicate the alert to other ac-healthchecker users." [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi)
[12:51:43] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[12:52:13] <icinga-wm>	 PROBLEM - Zookeeper Server on an-conf1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[12:52:27] <jinxer-wm>	 FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:54:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T396130)', diff saved to https://phabricator.wikimedia.org/P78037 and previous config saved to /var/cache/conftool/dbconfig/20250616-125442-marostegui.json
[12:54:48] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[12:54:48] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Try subresource JS autologin on SUL3 domain first if configured [extensions/CentralAuth] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159444 (https://phabricator.wikimedia.org/T391284)
[12:54:53] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:54:55] <wikibugs>	 (03CR) 10Majavah: [C:03+1] "Mostly I want to be alerted when a service is unhealthy causing the announcement to be withdrawn. On a closer look the old monitoring didn" [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi)
[12:55:04] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Fix adding warnings to ParserOutput [extensions/TemplateStyles] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159446 (https://phabricator.wikimedia.org/T396768)
[12:55:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159444 (https://phabricator.wikimedia.org/T391284) (owner: 10Bartosz Dziewoński)
[12:55:23] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] Fix adding warnings to ParserOutput [extensions/TemplateStyles] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159446 (https://phabricator.wikimedia.org/T396768) (owner: 10Bartosz Dziewoński)
[12:55:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/TemplateStyles] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159446 (https://phabricator.wikimedia.org/T396768) (owner: 10Bartosz Dziewoński)
[12:55:27] <jynus>	 was there a small net downtime on eqiad C3 ? several hosts complained at the same time
[12:55:33] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[12:57:36] <wikibugs>	 (03CR) 10Btullis: [C:03+1] mediawiki: convert the dumps Job into a CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156831 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[12:57:49] <wikibugs>	 (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: drop the batch.Job.get rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156832 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[12:57:57] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:58:37] <wikibugs>	 (03PS2) 10Brouberol: mediawiki: convert the dumps Job into a CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156831 (https://phabricator.wikimedia.org/T389786)
[12:58:37] <wikibugs>	 (03PS2) 10Brouberol: mediawiki-dumps-legacy: drop the batch.Job.get rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156832 (https://phabricator.wikimedia.org/T389786)
[12:58:43] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[12:58:45] <jynus>	 webproxy is timing me out
[12:58:47] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:58:58] <jynus>	 things are weird right now, network-wise
[12:59:12] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of bast7002.wikimedia.org to drbd
[12:59:27] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10918166 (10ops-monitoring-bot) VM bast7002.wikimedia.org switching disk type to drbd
[12:59:33] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1300).
[13:00:04] <jouncebot>	 phuedx, Tchanders, Mvolz, mszabo, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <phuedx>	 o/
[13:00:13] <jynus>	 topranks: I think (potentially on C3, but not 100% sure) something os causing network downtimes
[13:00:14] <Lucas_WMDE>	 I can’t deploy at the moment, I’m in a meeting
[13:00:19] <Lucas_WMDE>	 might be able to deploy in 30 minutes if nobody else is around
[13:00:20] <jynus>	 C3 on eqiad
[13:00:29] <topranks>	 jynus: ok 
[13:00:48] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] temp accounts: Enable temp account creation on three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155683 (https://phabricator.wikimedia.org/T396464) (owner: 10Tchanders)
[13:00:52] <MatmaRex>	 hi
[13:01:36] <Tchanders>	 o/
[13:02:12] <topranks>	 jynus: what do you suspect is happening?
[13:02:27] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:02:28] <jynus>	 communication within our cluster seem flaky at times
[13:02:32] <topranks>	 "several hosts complained at the same time"
[13:02:48] <topranks>	 only in that rack?
[13:02:49] <jynus>	 multiple times I got "Could not connect to webproxy.eqiad.wmnet"
[13:02:53] <jynus>	 topranks: mostly
[13:03:03] <jynus>	 that's why I am not 100 sure
[13:03:04] <mszabo>	 o/
[13:03:06] <Tchanders>	 I can start deploying some of these patches, might not have time for all of them
[13:03:19] <phuedx>	 I can do the config changes and backports as two separate deploys using SpiderPig?
[13:03:19] <mszabo>	 I can self-service if I don't fit into the window
[13:03:34] <jynus>	 topranks: let's say I observer only errors on C3, but I cannot say it was something else too
[13:03:38] <phuedx>	 Tchanders beat me to it :)
[13:03:48] <Tchanders>	 phuedx: Go for it!
[13:04:09] <XioNoX>	 !log disable puppet on all hosts using the bird puppet module for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052109
[13:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:25] <phuedx>	 Mvolz: yt?
[13:04:49] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Bird: use the "interface" config option for v6 peers [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi)
[13:04:51] <topranks>	 jynus: ok, to be specific wikikube-workers is it?
[13:04:54] <wikibugs>	 (03PS1) 10Stevemunene: add an-conf1006 to the list of analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/1159451 (https://phabricator.wikimedia.org/T374922)
[13:04:56] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2080.codfw.wmnet with OS bullseye
[13:05:05] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#10918197 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2080.codfw.wmnet...
[13:05:12] <Mvolz>	 phuedx: yup
[13:05:20] <jynus>	 "connection timed out" from es1032
[13:06:27] <jynus>	 my guess is the other hosts that complained (an-conf1006 or where I saw packets being lost: db1150) had the same network issue
[13:06:33] <Mvolz>	 I don't have permissions to +2 the config repo, but I can do the deploy itself myself (with hand holding).
[13:06:38] <phuedx>	 Tchanders, Mvolz: I _think_ I can bundle our config changes into one deploy to reduce time. They're all completely unrelated
[13:07:06] <Mvolz>	 I'm okay with that but if it goes wrong you'll have to roll back the whole thing
[13:07:16] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] add an-conf1006 to the list of analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/1159451 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene)
[13:07:22] <Tchanders>	 phuedx: That sounds good
[13:07:25] <wikibugs>	 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10918209 (10Vgutierrez) >>! In T388809#10893993, @siebrand wrote: > DNS NS records updated, and now pointing to Wikimedia.  we need DNSSEC disabled on the registrar to be able to handl...
[13:07:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156872 (https://phabricator.wikimedia.org/T395692) (owner: 10Phuedx)
[13:07:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders)
[13:07:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153307 (https://phabricator.wikimedia.org/T395933) (owner: 10Dreamy Jazz)
[13:07:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz)
[13:07:44] <topranks>	 jynus: ok running some tests now from them hosts to see if I can find anything 
[13:08:18] <wikibugs>	 (03Merged) 10jenkins-bot: ext-EventStreamConfig: Update product_metrics.web_base stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156872 (https://phabricator.wikimedia.org/T395692) (owner: 10Phuedx)
[13:08:24] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders)
[13:08:26] <wikibugs>	 (03Merged) 10jenkins-bot: Enable temporary accounts onboarding dialog on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153307 (https://phabricator.wikimedia.org/T395933) (owner: 10Dreamy Jazz)
[13:08:26] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] add an-conf1006 to the list of analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/1159451 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene)
[13:08:28] <wikibugs>	 (03Merged) 10jenkins-bot: Change citoid config for test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz)
[13:08:43] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org [reason: bird testing CR 1052109]
[13:08:43] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1156872|ext-EventStreamConfig: Update product_metrics.web_base stream (T395692)]], [[gerrit:1127960|Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group (T376315)]], [[gerrit:1153307|Enable temporary accounts onboarding dialog on WMF wikis (T395933)]], [[gerrit:1139808|Change citoid config for test wiki (T361576)]]
[13:08:54] <stashbot>	 T395692: Add performer_pageview_id contextual attribute to web base stream - https://phabricator.wikimedia.org/T395692
[13:08:54] <stashbot>	 T376315: Control access to global-temporary-account-viewer group on WMF wikis automatically - https://phabricator.wikimedia.org/T376315
[13:08:54] <stashbot>	 T395933: Enable the temporary accounts onboarding dialog on WMF wikis - https://phabricator.wikimedia.org/T395933
[13:08:54] <stashbot>	 T361576: Switch from restbase to rest-gateway for Citoid - https://phabricator.wikimedia.org/T361576
[13:09:47] <jynus>	 topranks: it seems not to be ongoing, so maybe someone just started a too fast data transmission from that rack
[13:09:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P78039 and previous config saved to /var/cache/conftool/dbconfig/20250616-130950-marostegui.json
[13:09:56] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki: convert the dumps Job into a CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156831 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[13:09:57] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[13:09:58] <Mvolz>	 which server do i pick for testing? (when the time comes) I forget :)
[13:10:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: drop the batch.Job.get rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156832 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[13:10:22] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloudgw: Install natlog on cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[13:10:30] <topranks>	 jynus: yeah I didn't spot anything in graphs yet, I'll start looking at logs now shortly once I verify things look ok right now 
[13:10:38] <topranks>	 do you have an approximate timestamp you noticed the problems?
[13:10:40] <logmsgbot>	 !log phuedx@deploy1003 phuedx, mvolz, dreamyjazz, tchanders: Backport for [[gerrit:1156872|ext-EventStreamConfig: Update product_metrics.web_base stream (T395692)]], [[gerrit:1127960|Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group (T376315)]], [[gerrit:1153307|Enable temporary accounts onboarding dialog on WMF wikis (T395933)]], [[gerrit:1139808|Change citoid config for test wiki (T361576)]] synced to t
[13:10:40] <logmsgbot>	 he testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:11:08] <phuedx>	 Tchanders, Mvolz: Please test your changes and report back
[13:11:11] <wikibugs>	 (03CR) 10Effie Mouzeli: "That is all correct!" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[13:11:20] <jynus>	 topranks: I saw a small increase in tcp retransmited, but nothing out of the ordinary: https://grafana.wikimedia.org/goto/-EQm46YNg?orgId=1
[13:11:20] <wikibugs>	 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10918247 (10MatthewVernon) Quick question: I'm concerned about the rather vague timeline for deleting `tegola-swift-eqiad-v...
[13:11:26] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org [reason: tested CR 1052109]
[13:11:56] <Tchanders>	 phuedx: Looks good to me
[13:12:27] <MatmaRex>	 fyi, i can't deploy my own changes, i would appreciate if someone could click the necessary buttons for me. they can go out together to save time.
[13:12:36] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: convert the dumps Job into a CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156831 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[13:12:52] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: drop the batch.Job.get rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156832 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol)
[13:14:00] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[13:14:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T395241)', diff saved to https://phabricator.wikimedia.org/P78040 and previous config saved to /var/cache/conftool/dbconfig/20250616-131410-fceratto.json
[13:14:21] <wikibugs>	 (03CR) 10Bking: [C:03+2] elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1138400 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking)
[13:15:01] <phuedx>	 I've tested my change. I'm seeing the correct contextual attributes coming through for enwiki and metawikiwiki
[13:15:06] <phuedx>	 Mvolz?
[13:15:11] <Mvolz>	 phuedx: mine broke test wiki
[13:15:13] <Mvolz>	 no go
[13:15:13] <icinga-wm>	 RECOVERY - Zookeeper Server on an-conf1006 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[13:15:15] <Mvolz>	 sorry
[13:15:32] <Mvolz>	 at least it didn't break en wiki
[13:15:42] <wikibugs>	 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10918268 (10MatthewVernon) To put a little more context on that: ` root@thanos-fe1004:/home/mvernon# for b in $(swift list)...
[13:15:45] <phuedx>	 Noted. I'll stop this deployment
[13:15:59] <phuedx>	 Mvolz: Could you submit a revert?
[13:16:06] <logmsgbot>	 !log phuedx@deploy1003 Sync cancelled.
[13:16:32] <wikibugs>	 (03PS1) 10Mvolz: Revert "Change citoid config for test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159455
[13:16:39] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[13:16:44] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[13:16:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1026 T395241', diff saved to https://phabricator.wikimedia.org/P78041 and previous config saved to /var/cache/conftool/dbconfig/20250616-131646-marostegui.json
[13:17:10] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1026.eqiad.wmnet with reason: Maintenance
[13:17:11] <wikibugs>	 (03PS1) 10Majavah: natlog: Set required START=yes on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1159456 (https://phabricator.wikimedia.org/T273734)
[13:17:19] <phuedx>	 Tchanders, Mvolz: I'll sync that revert and that should get us to where we need to be
[13:17:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159455 (owner: 10Mvolz)
[13:18:00] <wikibugs>	 (03CR) 10Majavah: [C:03+2] natlog: Set required START=yes on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1159456 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[13:18:22] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast7002.wikimedia.org to drbd
[13:18:37] <icinga-wm>	 PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:18:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] codfw1dev ceph: cloudcephmons -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1159425 (https://phabricator.wikimedia.org/T309789) (owner: 10Andrew Bogott)
[13:18:53] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] swift: restore ms-be2080 to the rings post-reimage [puppet] - 10https://gerrit.wikimedia.org/r/1138832 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[13:19:15] <icinga-wm>	 RECOVERY - Restbase root url on restbase1043 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[13:19:17] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Change citoid config for test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159455 (owner: 10Mvolz)
[13:19:23] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye
[13:19:29] <icinga-wm>	 RECOVERY - Restbase root url on restbase1041 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[13:19:30] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1159455|Revert "Change citoid config for test wiki"]]
[13:19:39] <icinga-wm>	 RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 115.43 ms
[13:19:59] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye
[13:20:44] <wikibugs>	 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10918314 (10elukey) @Jgiannelos @MSantos Hi! My understanding is that Tegola is now using `tegola-swift-codfw-v002` and `te...
[13:20:49] <Tchanders>	 phuedx: Thank you!
[13:21:16] <wikibugs>	 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10918325 (10MatthewVernon) ...so ideally, delete all the old data and then you can just go ahead (and maybe let's make a ro...
[13:21:24] <logmsgbot>	 !log phuedx@deploy1003 mvolz, phuedx: Backport for [[gerrit:1159455|Revert "Change citoid config for test wiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:21:29] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10918326 (10Jelto) >>! In T378922#10848027, @jcrespo wrote: > I am working on setting up the dedicated gitlab/gerrit storage host,...
[13:21:42] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10918327 (10herron) 05Open→03Resolved
[13:21:50] <wikibugs>	 (03CR) 10MVernon: [C:03+2] swift: restore ms-be2080 to the rings post-reimage [puppet] - 10https://gerrit.wikimedia.org/r/1138832 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[13:21:55] <phuedx>	 Mvolz: Could you check that testwiki is OK now?
[13:22:04] <wikibugs>	 (03CR) 10Mvolz: "When we tried to deploy this it literally put "false" in the test wiki config... i.e. requests were made to https://test.wikipedia.org/w/f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz)
[13:22:04] <phuedx>	 Tchanders: Would you mind re-checking your changes? I'll do the same
[13:22:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78042 and previous config saved to /var/cache/conftool/dbconfig/20250616-132250-root.json
[13:23:26] <Tchanders>	 phuedx: Still looks good
[13:23:34] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-backup2002.codfw.wmnet: Renew puppet certificate - root@cumin1002
[13:23:37] <wikibugs>	 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10918333 (10MoritzMuehlenhoff) >>! In T396584#10918325, @MatthewVernon wrote: > ...so ideally, delete all the old data and...
[13:24:07] <phuedx>	 I've re-confirmed that the correct context attributes are appearing on enwiki and metawikiwiki
[13:24:24] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#10918346 (10MatthewVernon)
[13:24:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner2002.codfw.wmnet with OS bookworm
[13:24:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T395241)', diff saved to https://phabricator.wikimedia.org/P78043 and previous config saved to /var/cache/conftool/dbconfig/20250616-132452-fceratto.json
[13:25:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P78044 and previous config saved to /var/cache/conftool/dbconfig/20250616-132504-marostegui.json
[13:25:19] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7003.magru.wmnet to drbd
[13:25:44] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10918362 (10ops-monitoring-bot) VM ncredir7003.magru.wmnet switching disk type to drbd
[13:26:13] <wikibugs>	 (03PS1) 10Majavah: natlog: Fix line matching [puppet] - 10https://gerrit.wikimedia.org/r/1159457 (https://phabricator.wikimedia.org/T273734)
[13:26:40] <Mvolz>	 phuedx: yeah it's okay now
[13:26:55] <phuedx>	 Thanks. Continuing
[13:27:07] <logmsgbot>	 !log phuedx@deploy1003 mvolz, phuedx: Continuing with sync
[13:27:37] <wikibugs>	 (03CR) 10Majavah: [C:03+2] natlog: Fix line matching [puppet] - 10https://gerrit.wikimedia.org/r/1159457 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah)
[13:30:44] <wikibugs>	 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10918393 (10herron) 05Open→03Stalled
[13:31:57] <sukhe>	 !log T362392
[13:31:59] <sukhe>	 ha
[13:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:02] <stashbot>	 T362392: Routed Ganeti: Add support for VM BGP - https://phabricator.wikimedia.org/T362392
[13:32:10] <sukhe>	 !log sudo cumin -b1 -s30 'A:dnsbox' "run-puppet-agent --enable 'CR1052109'": T362392
[13:32:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:58] <sukhe>	 !log sudo cumin -b1 -s30 'A:wikidough' "run-puppet-agent --enable 'CR1052109'": T362392
[13:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:53] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159455|Revert "Change citoid config for test wiki"]] (duration: 14m 22s)
[13:34:24] <phuedx>	 Tchanders, Mvolz: Done
[13:34:39] <wikibugs>	 (03PS1) 10Bking: elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1159459 (https://phabricator.wikimedia.org/T387569)
[13:35:05] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159459 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking)
[13:35:16] <wikibugs>	 (03PS2) 10NMW03: Set category collation to "uca-az" for Azerbaijani projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896)
[13:35:20] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7003.magru.wmnet to drbd
[13:35:22] <icinga-wm>	 PROBLEM - Host ncredir7003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:40] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[13:35:42] <icinga-wm>	 RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 115.82 ms
[13:36:47] <sukhe>	 XioNoX: ^ what was this flap about? I don't see how ncredir could be related to the bird change but perhaps the ganeti?
[13:36:48] <phuedx>	 mszabo: Do you want to self-service deploy after I've deployed MatmaRex's as a pair?
[13:37:00] <mszabo>	 sounds good
[13:37:27] <XioNoX>	 sukhe: moritzm switching the VMs back to drbd
[13:37:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) (owner: 10NMW03)
[13:37:50] <XioNoX>	 sukhe: see few lines above "jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7003.magru.wmnet to drbd"
[13:37:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78045 and previous config saved to /var/cache/conftool/dbconfig/20250616-133755-root.json
[13:38:03] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus7002.magru.wmnet to drbd
[13:38:47] <moritzm>	 sukhe: these are inactive nodes, I'm switching the VMs to DRDB disk storage now that the routed Ganeti cluster has grown to three nodes
[13:38:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159444 (https://phabricator.wikimedia.org/T391284) (owner: 10Bartosz Dziewoński)
[13:38:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/TemplateStyles] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159446 (https://phabricator.wikimedia.org/T396768) (owner: 10Bartosz Dziewoński)
[13:38:51] <sukhe>	 ah thanks, sorry!
[13:38:52] <sukhe>	 missed that
[13:39:05] <phuedx>	 MatmaRex: I'll ping you when the changes are ready to test
[13:39:14] <MatmaRex>	 thanks
[13:39:19] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[13:39:24] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10918548 (10ops-monitoring-bot) VM prometheus7002.magru.wmnet switching disk type to drbd
[13:39:46] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10918550 (10jcrespo) >>! In T378922#10918326, @Jelto wrote: > Thank you for the work on dedicated hardware. In T378922#10804784 I t...
[13:39:56] <Nemoralis>	 I guess there will be no time left for my patch, right?
[13:40:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P78046 and previous config saved to /var/cache/conftool/dbconfig/20250616-134000-fceratto.json
[13:40:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T396130)', diff saved to https://phabricator.wikimedia.org/P78047 and previous config saved to /var/cache/conftool/dbconfig/20250616-134012-marostegui.json
[13:40:18] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[13:40:29] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1226.eqiad.wmnet with reason: Maintenance
[13:40:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T396130)', diff saved to https://phabricator.wikimedia.org/P78048 and previous config saved to /var/cache/conftool/dbconfig/20250616-134036-marostegui.json
[13:40:40] <wikibugs>	 (03Merged) 10jenkins-bot: Try subresource JS autologin on SUL3 domain first if configured [extensions/CentralAuth] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159444 (https://phabricator.wikimedia.org/T391284) (owner: 10Bartosz Dziewoński)
[13:41:01] <wikibugs>	 (03Merged) 10jenkins-bot: Fix adding warnings to ParserOutput [extensions/TemplateStyles] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159446 (https://phabricator.wikimedia.org/T396768) (owner: 10Bartosz Dziewoński)
[13:41:20] <logmsgbot>	 !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1159444|Try subresource JS autologin on SUL3 domain first if configured (T391284)]], [[gerrit:1159446|Fix adding warnings to ParserOutput (T396768)]]
[13:41:26] <stashbot>	 T391284: Swap order of central autologin lookup for loginwiki and shared domain - https://phabricator.wikimedia.org/T391284
[13:41:26] <stashbot>	 T396768: InvalidArgumentException: Scalar parameter must be a string, number, Stringable, or MessageSpecifier; got array - https://phabricator.wikimedia.org/T396768
[13:41:50] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10918563 (10herron)
[13:42:13] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2002.codfw.wmnet with reason: host reimage
[13:42:51] <wikibugs>	 (03PS1) 10Fabfur: cache,haproxy: remove old ipblock map files [puppet] - 10https://gerrit.wikimedia.org/r/1159461 (https://phabricator.wikimedia.org/T396621)
[13:43:13] <logmsgbot>	 !log phuedx@deploy1003 phuedx, matmarex: Backport for [[gerrit:1159444|Try subresource JS autologin on SUL3 domain first if configured (T391284)]], [[gerrit:1159446|Fix adding warnings to ParserOutput (T396768)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:43:33] <MatmaRex>	 testing
[13:43:45] <phuedx>	 Turns out you get pinged automatically :)
[13:45:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:45:45] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1159459 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking)
[13:45:51] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[13:45:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2002.codfw.wmnet with reason: host reimage
[13:45:56] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159461 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur)
[13:46:28] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] cirrussearch: return traffic to all DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155738 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[13:47:10] <sukhe>	 !log enable puppet and run agent on cephosd1001
[13:47:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:13] <MatmaRex>	 phuedx: both look good
[13:47:17] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10918595 (10herron) Hi @Anton.Kokh could you please add a unique SSH key he...
[13:47:23] <phuedx>	 MatmaRex: ACK. Continuing
[13:47:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:47:30] <logmsgbot>	 !log phuedx@deploy1003 phuedx, matmarex: Continuing with sync
[13:48:13] <wikibugs>	 (03PS4) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly determine latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767)
[13:48:24] <wikibugs>	 (03CR) 10Tchanders: "We have the go-ahead from product and comms." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155683 (https://phabricator.wikimedia.org/T396464) (owner: 10Tchanders)
[13:49:17] <wikibugs>	 (03CR) 10Volans: [C:03+2] phabricator: expand support for Phabricator tasks [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 (owner: 10Volans)
[13:49:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10918610 (10herron)
[13:51:07] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Switch esams to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1159462 (https://phabricator.wikimedia.org/T395131)
[13:51:20] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159462 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez)
[13:51:25] <wikibugs>	 (03PS5) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly determine latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767)
[13:51:45] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-availability magru - https://slo.wikimedia.org/?search=wdqs-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:52:25] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "!" [puppet] - 10https://gerrit.wikimedia.org/r/1159462 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez)
[13:52:53] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-backup2001.codfw.wmnet with reason: Maintenance and reboot
[13:53:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78049 and previous config saved to /var/cache/conftool/dbconfig/20250616-135301-root.json
[13:54:29] <logmsgbot>	 !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159444|Try subresource JS autologin on SUL3 domain first if configured (T391284)]], [[gerrit:1159446|Fix adding warnings to ParserOutput (T396768)]] (duration: 13m 09s)
[13:54:35] <stashbot>	 T391284: Swap order of central autologin lookup for loginwiki and shared domain - https://phabricator.wikimedia.org/T391284
[13:54:35] <stashbot>	 T396768: InvalidArgumentException: Scalar parameter must be a string, number, Stringable, or MessageSpecifier; got array - https://phabricator.wikimedia.org/T396768
[13:54:37] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Good spot, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1159293 (https://phabricator.wikimedia.org/T395557) (owner: 10Muehlenhoff)
[13:55:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P78050 and previous config saved to /var/cache/conftool/dbconfig/20250616-135507-fceratto.json
[13:55:11] <MatmaRex>	 thanks phuedx
[13:55:55] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Switch esams to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1159462 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez)
[13:56:04] <wikibugs>	 (03Merged) 10jenkins-bot: phabricator: expand support for Phabricator tasks [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 (owner: 10Volans)
[13:56:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T396130)', diff saved to https://phabricator.wikimedia.org/P78051 and previous config saved to /var/cache/conftool/dbconfig/20250616-135605-marostegui.json
[13:56:09] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[13:56:15] <mszabo>	 jouncebot: nowandnext
[13:56:16] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1300)
[13:56:16] <jouncebot>	 In 1 hour(s) and 33 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1530)
[13:56:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo)
[13:56:43] <phuedx>	 mszabo: No change in the logs after those deployments. Over to you :)
[13:57:26] <mszabo>	 thanks!
[13:57:39] <vgutierrez>	 !log use Google Trust Services (GTS) unified TLS certificate on esams - T395131
[13:57:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:43] <stashbot>	 T395131: Replace Digicert TLS certs with Google Trust Services ones - https://phabricator.wikimedia.org/T395131
[13:58:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/Echo] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159438 (https://phabricator.wikimedia.org/T58074) (owner: 10Máté Szabó)
[13:58:52] <icinga-wm>	 PROBLEM - Host prometheus7002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:59:50] <Dreamy_Jazz>	 May need to revert the enabling of the onboarding dialog (Tchanders change)
[14:00:34] <wikibugs>	 (03PS6) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767)
[14:00:57] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[14:01:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#10918659 (10MoritzMuehlenhoff) p:05Triage→03Medium
[14:01:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[14:01:15] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Advance snapshot dbbackups start time by 4 hours [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo)
[14:01:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10918663 (10herron) Hello!  Here are a few next-steps to complete before proceeding with access:  * @KFrancis could you please confirm NDA for @AndyRussG / @AndyRussG_volunte...
[14:01:29] <phuedx>	 Dreamy_Jazz: There's a large gap between now and the Wikimedia Portals Update. We've got a lot of room :)
[14:01:38] <Dreamy_Jazz>	 Sure. Thanks.
[14:01:47] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye
[14:01:49] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Prepare for the Get Started notification experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958)
[14:01:53] <wikibugs>	 (03CR) 10Effie Mouzeli: "Yeah I agree, I do not have strong opinions either" [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[14:02:05] <wikibugs>	 (03CR) 10Urbanecm: [C:04-2] "needs Kirsten's confirmation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm)
[14:02:42] <mszabo>	 Dreamy_Jazz: I already have a running deploy, but hopefully should be done soon
[14:04:39] <wikibugs>	 (03PS7) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767)
[14:05:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2002.codfw.wmnet with OS bookworm
[14:05:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[14:05:13] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:05:17] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:04-1] ores-extension: enable extension with revertrisk filter for the third batch of wikis (038 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[14:05:38] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:05:49] <wikibugs>	 (03PS8) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767)
[14:06:18] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[14:07:11] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1156837 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French)
[14:07:14] <wikibugs>	 (03CR) 10Scott French: [C:03+2] alertmanager: update data-persistence-task phid [puppet] - 10https://gerrit.wikimedia.org/r/1156837 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French)
[14:08:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78052 and previous config saved to /var/cache/conftool/dbconfig/20250616-140807-root.json
[14:08:16] <Dreamy_Jazz>	 Worked out the issue with Tchanders change. It's an issue with a translation and we have decided to leave it enabled.
[14:09:51] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye
[14:10:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T395241)', diff saved to https://phabricator.wikimedia.org/P78053 and previous config saved to /var/cache/conftool/dbconfig/20250616-141016-fceratto.json
[14:10:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add missing labels for email confirmation reminder preferences [extensions/Echo] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159438 (https://phabricator.wikimedia.org/T58074) (owner: 10Máté Szabó)
[14:10:36] <logmsgbot>	 !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1159438|Add missing labels for email confirmation reminder preferences (T58074)]]
[14:10:37] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[14:10:40] <stashbot>	 T58074: Echo: Generate periodic web notification to nudge users to confirm an unverified email address - https://phabricator.wikimedia.org/T58074
[14:10:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T395241)', diff saved to https://phabricator.wikimedia.org/P78054 and previous config saved to /var/cache/conftool/dbconfig/20250616-141044-fceratto.json
[14:11:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P78055 and previous config saved to /var/cache/conftool/dbconfig/20250616-141113-marostegui.json
[14:13:32] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:13:43] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:14:46] <wikibugs>	 (03PS1) 10Andrew Bogott: All cloudcephmon nodes to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1159469
[14:14:54] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thanks! We have another bird-related change being rolled out today, just in case you were planning to merge it today. Tomorrow should be g" [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi)
[14:15:27] <wikibugs>	 (03PS2) 10Andrew Bogott: All cloudcephmon nodes to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1159469
[14:15:29] <wikibugs>	 (03CR) 10Bking: [C:03+2] elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1159459 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking)
[14:15:52] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159469 (owner: 10Andrew Bogott)
[14:17:04] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:17:16] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:17:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_magru01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:17:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_magru01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:18:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] All cloudcephmon nodes to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1159469 (owner: 10Andrew Bogott)
[14:18:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10918768 (10MoritzMuehlenhoff)
[14:19:27] <phuedx>	 Dreamy_Jazz: Noted. mszabo: Is the deployment still running?
[14:19:51] <vgutierrez>	 !log upload liberica 0.19 to apt.wm.o (bookworm-wikimedia) - T397036
[14:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:56] <stashbot>	 T397036: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036
[14:19:58] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159470
[14:20:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T395241)', diff saved to https://phabricator.wikimedia.org/P78056 and previous config saved to /var/cache/conftool/dbconfig/20250616-142017-fceratto.json
[14:21:21] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:21:33] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:22:10] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363#10918783 (10Andrew) 05Open→03Resolved a:03Andrew
[14:23:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10918790 (10Andrew) @Jhancock.wm any more blockers to this? There's no actual rush although finishing this will help me a bit with T309789 as it will allow...
[14:23:59] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:24:19] <wikibugs>	 (03PS9) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767)
[14:24:24] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[14:24:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10918794 (10herron) @WMDE-leszek is this for a contract with end-date, or for ongoing access?
[14:25:38] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[14:25:51] <wikibugs>	 (03PS1) 10Brouberol: airflow: hotfix, remove duplicated env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159471 (https://phabricator.wikimedia.org/T369845)
[14:25:58] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154010 (owner: 10PipelineBot)
[14:26:00] <wikibugs>	 (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154800 (owner: 10PipelineBot)
[14:26:03] <wikibugs>	 (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155703 (owner: 10PipelineBot)
[14:26:16] <wikibugs>	 (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155735 (owner: 10PipelineBot)
[14:26:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P78057 and previous config saved to /var/cache/conftool/dbconfig/20250616-142620-marostegui.json
[14:26:33] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/1156838 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French)
[14:26:47] <wikibugs>	 (03PS2) 10Brouberol: airflow: hotfix, remove duplicated env variables and volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159471 (https://phabricator.wikimedia.org/T369845)
[14:26:51] <wikibugs>	 (03CR) 10Scott French: [C:03+2] sessionstore-resources: move SessionStoreDiskSpaceRunwayTooLow to task [alerts] - 10https://gerrit.wikimedia.org/r/1156838 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French)
[14:28:10] <wikibugs>	 (03CR) 10Majavah: [C:03+1] prometheus: move blackbox-exporter to log prober errors [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi)
[14:28:18] <mszabo>	 phuedx: yeah, still building the image
[14:28:28] <vgutierrez>	 !log upgrade to liberica 0.19 in lvs1013 - T397036
[14:28:28] <wikibugs>	 (03PS10) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767)
[14:28:30] <wikibugs>	 (03Merged) 10jenkins-bot: sessionstore-resources: move SessionStoreDiskSpaceRunwayTooLow to task [alerts] - 10https://gerrit.wikimedia.org/r/1156838 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French)
[14:28:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:31] <stashbot>	 T397036: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036
[14:28:34] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs1013.eqiad.wmnet} and A:liberica (T397036)
[14:28:43] <jinxer-wm>	 FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[14:29:06] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs1013.eqiad.wmnet} and A:liberica (T397036)
[14:29:13] <Dreamy_Jazz>	 jouncebot: nowandnext
[14:29:13] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 0 minute(s)
[14:29:13] <jouncebot>	 In 1 hour(s) and 0 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1530)
[14:30:00] <mszabo>	 14:26:12 [root] Image builds completed is the last log I see locally
[14:30:03] <Dreamy_Jazz>	 Decided that we do want to undo Tchanders
[14:30:12] <Dreamy_Jazz>	 *Tchanders change
[14:31:09] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: hotfix, remove duplicated env variables and volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159471 (https://phabricator.wikimedia.org/T369845) (owner: 10Brouberol)
[14:31:17] <wikibugs>	 (03PS1) 10Dreamy Jazz: Revert "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159477
[14:31:30] <mszabo>	 should be done in a sec now, it's finally deploying to testservers
[14:31:56] <wikibugs>	 (03PS2) 10Dreamy Jazz: Revert "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159477
[14:32:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: "For sure, I'll merge tomorrow EU morning" [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi)
[14:32:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi)
[14:32:43] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage
[14:33:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:34:12] <Dreamy_Jazz>	 Started a spiderpig job for the revert given that everything else in the window seems to be done and just waiting on this last one to merge.
[14:35:23] <mszabo>	 seems like my deploy is stuck on one of the non-k8s testservers for 4mins now
[14:35:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155738 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:35:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P78058 and previous config saved to /var/cache/conftool/dbconfig/20250616-143525-fceratto.json
[14:36:28] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus7002.magru.wmnet to drbd
[14:36:48] <jinxer-wm>	 FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:36:49] <icinga-wm>	 RECOVERY - Host prometheus7002 is UP: PING OK - Packet loss = 0%, RTA = 115.42 ms
[14:36:55] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10918872 (10Scott_French)
[14:37:48] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10918882 (10Scott_French) 05Open→03Resolved With the alert routing and severity changes now merged, I believe that wrap...
[14:38:43] <jinxer-wm>	 RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[14:39:11] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10918887 (10Jelto) >>! In T378922#10918550, @jcrespo wrote: >  > I'm sorry, but I thought that was an "outline", a summary of our d...
[14:39:31] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:39:38] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-availability magru - https://slo.wikimedia.org/?search=wdqs-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[14:40:13] <mszabo>	 Dreamy_Jazz: I'd say go ahead, scap clearly isn't collaborating with me today - I'm not sure if it allows you to go ahead in the present state, I can kill my deployment process as needed
[14:40:34] <Dreamy_Jazz>	 I would need to wait for your scap lock to be released.
[14:40:42] <claime>	 mszabo: if you do that, your patch will get deployed
[14:40:51] <claime>	 when the next scap run goes
[14:41:02] <mszabo>	 14:40:23 Started scap-cdb-rebuild-testservers
[14:41:06] <mszabo>	 it's watching us, clearly
[14:41:10] <Dreamy_Jazz>	 :D
[14:41:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T396130)', diff saved to https://phabricator.wikimedia.org/P78059 and previous config saved to /var/cache/conftool/dbconfig/20250616-144127-marostegui.json
[14:41:32] <stashbot>	 T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130
[14:41:34] <claime>	 I'd advise letting it finish :D
[14:41:36] <mszabo>	 claime: yeah that would have been fine since could have checked it on the testservers in the next attempt
[14:41:43] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[14:41:45] <mszabo>	 but now hopefully we've broken the impasse
[14:41:48] <jinxer-wm>	 FIRING: [3x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:41:55] <Dreamy_Jazz>	 The reason it's going slowly is because your backport had i18n changes
[14:42:14] <Dreamy_Jazz>	 Changing i18n in backports makes everything really slow
[14:42:31] <mszabo>	 yeah fair, I wonder why there are non-k8s test servers in there though - I thought the non-k8s mwdebug was gone already
[14:42:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051 (10jijiki) 03NEW
[14:42:51] <mszabo>	 that's T276994 apparently
[14:42:51] <stashbot>	 T276994: Provide an mwdebug functionality on kubernetes  (mw-experimental) - https://phabricator.wikimedia.org/T276994
[14:42:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10918906 (10jijiki)
[14:43:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:43:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:43:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10918909 (10jijiki)
[14:46:48] <jinxer-wm>	 FIRING: [5x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:47:10] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[14:48:29] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1159438|Add missing labels for email confirmation reminder preferences (T58074)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:48:34] <stashbot>	 T58074: Echo: Generate periodic web notification to nudge users to confirm an unverified email address - https://phabricator.wikimedia.org/T58074
[14:49:05] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Add link to list archives in default footer - https://phabricator.wikimedia.org/T284256#10918944 (10Effeietsanders) I ran into this again as admin, who received a reminder of pending moderation requests. That currently has no link to Posterius and it's actually quite a few cli...
[14:49:16] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Continuing with sync
[14:49:22] <mszabo>	 yay
[14:49:56] <phuedx>	 \o/
[14:50:17] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye
[14:50:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P78060 and previous config saved to /var/cache/conftool/dbconfig/20250616-145032-fceratto.json
[14:50:55] <phuedx>	 I wonder if there's a Phab task for making our deployment tooling flag that a change has i18n changes and so will take $aLongTime
[14:51:48] <jinxer-wm>	 FIRING: [6x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:52:39] <mszabo>	 well it did tell me it rebuilt the localization cache, I just didn't draw the proper conclusion :)
[14:53:08] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs7002.magru.wmnet} and A:liberica (T397036)
[14:53:12] <stashbot>	 T397036: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036
[14:53:22] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin depooling P{lvs7002.magru.wmnet} and A:liberica
[14:53:32] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7002.magru.wmnet} and A:liberica
[14:53:55] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin pooling P{lvs7002.magru.wmnet} and A:liberica
[14:54:01] <wikibugs>	 (03CR) 10Gkyziridis: "Much appreciated that you worked in this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[14:54:19] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs7002.magru.wmnet} and A:liberica
[14:54:20] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum1001 [puppet] - 10https://gerrit.wikimedia.org/r/1159486
[14:54:20] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum1002 [puppet] - 10https://gerrit.wikimedia.org/r/1159487
[14:54:20] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum2001 [puppet] - 10https://gerrit.wikimedia.org/r/1159488
[14:54:21] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum2002 [puppet] - 10https://gerrit.wikimedia.org/r/1159489
[14:54:21] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs7002.magru.wmnet} and A:liberica (T397036)
[14:54:22] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum4001 [puppet] - 10https://gerrit.wikimedia.org/r/1159490
[14:54:23] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum4002 [puppet] - 10https://gerrit.wikimedia.org/r/1159491
[14:54:27] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum5001 [puppet] - 10https://gerrit.wikimedia.org/r/1159492
[14:54:31] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum5002 [puppet] - 10https://gerrit.wikimedia.org/r/1159493
[14:54:35] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum6001 [puppet] - 10https://gerrit.wikimedia.org/r/1159494
[14:54:39] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum6002 [puppet] - 10https://gerrit.wikimedia.org/r/1159495
[14:54:43] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum7002 [puppet] - 10https://gerrit.wikimedia.org/r/1159496
[14:54:47] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum7003 [puppet] - 10https://gerrit.wikimedia.org/r/1159497
[14:54:51] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum3003 [puppet] - 10https://gerrit.wikimedia.org/r/1159498
[14:54:56] <wikibugs>	 (03PS1) 10Ssingh: hiera: set do_ech to false for durum3004 [puppet] - 10https://gerrit.wikimedia.org/r/1159499
[14:55:08] <vgutierrez>	 🍿
[14:56:48] <jinxer-wm>	 FIRING: [7x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:57:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10918982 (10MoritzMuehlenhoff) Looks fine, please use codfw/row C and eqiad eqiad/row B
[14:57:25] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs7001.magru.wmnet} and A:liberica (T397036)
[14:57:39] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin depooling P{lvs7001.magru.wmnet} and A:liberica
[14:57:51] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7001.magru.wmnet} and A:liberica
[14:58:12] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin pooling P{lvs7001.magru.wmnet} and A:liberica
[14:58:37] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs7001.magru.wmnet} and A:liberica
[14:58:39] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs7001.magru.wmnet} and A:liberica (T397036)
[14:58:44] <stashbot>	 T397036: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036
[15:00:40] <wikibugs>	 (03CR) 10Ssingh: "Plan is to merge per host and reimage." [puppet] - 10https://gerrit.wikimedia.org/r/1159486 (owner: 10Ssingh)
[15:01:48] <jinxer-wm>	 FIRING: [6x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[15:01:57] <wikibugs>	 (03PS7) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[15:02:00] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10919002 (10MoritzMuehlenhoff)
[15:02:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:03:05] <Dreamy_Jazz>	 msazbo: Where are you in the deployment now?
[15:03:15] <Dreamy_Jazz>	 *mszabo:
[15:03:34] <mszabo>	 Dreamy_Jazz: any second now
[15:03:51] <mszabo>	 3 2 1
[15:04:04] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] ores-extension: enable extension with revertrisk filter for the third batch of wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[15:04:05] <logmsgbot>	 !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159438|Add missing labels for email confirmation reminder preferences (T58074)]] (duration: 53m 29s)
[15:04:09] <mszabo>	 boom
[15:04:09] <stashbot>	 T58074: Echo: Generate periodic web notification to nudge users to confirm an unverified email address - https://phabricator.wikimedia.org/T58074
[15:04:16] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "I removed them!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[15:04:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159477 (owner: 10Dreamy Jazz)
[15:05:38] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159477 (owner: 10Dreamy Jazz)
[15:05:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T395241)', diff saved to https://phabricator.wikimedia.org/P78062 and previous config saved to /var/cache/conftool/dbconfig/20250616-150541-fceratto.json
[15:05:52] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1159477|Revert "Enable temporary accounts onboarding dialog on WMF wikis"]]
[15:06:03] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[15:06:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T395241)', diff saved to https://phabricator.wikimedia.org/P78063 and previous config saved to /var/cache/conftool/dbconfig/20250616-150609-fceratto.json
[15:06:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156515 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:48] <jinxer-wm>	 FIRING: [7x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[15:07:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:09:55] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1159477|Revert "Enable temporary accounts onboarding dialog on WMF wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:11:26] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet)
[15:11:48] <jinxer-wm>	 RESOLVED: [2x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1115:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[15:14:32] <wikibugs>	 (03CR) 10Herron: [C:03+1] thanos: enable tracing for rule [puppet] - 10https://gerrit.wikimedia.org/r/1159355 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[15:16:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T395241)', diff saved to https://phabricator.wikimedia.org/P78064 and previous config saved to /var/cache/conftool/dbconfig/20250616-151641-fceratto.json
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:17:27] <inflatador>	 !log bking@cumin2002:~$ sudo cumin A:lvs-low-traffic 'run-puppet-agent' T387569
[15:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:31] <stashbot>	 T387569: Update Elastic puppet code to filter LVS config based on cluster membership - https://phabricator.wikimedia.org/T387569
[15:17:51] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye
[15:18:40] <wikibugs>	 (03PS1) 10Effie Mouzeli: site.pp: add wikikube-worker-exp(1001|2001) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994)
[15:19:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] site.pp: add wikikube-worker-exp(1001|2001) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[15:20:04] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet
[15:20:35] <logmsgbot>	 !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrading clouddbs T394372
[15:20:39] <stashbot>	 T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372
[15:22:03] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[15:23:22] <wikibugs>	 (03PS2) 10Effie Mouzeli: site.pp: add wikikube-worker-exp(1001|2001) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994)
[15:23:30] <wikibugs>	 (03CR) 10FNegri: [C:03+2] clouddb1013: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154804 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri)
[15:23:43] <wikibugs>	 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10919161 (10elukey) 05Resolved→03Open
[15:23:56] <wikibugs>	 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10919163 (10elukey)
[15:25:33] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: mw-experimental-mediawiki-image-update: improve script [puppet] - 10https://gerrit.wikimedia.org/r/1153999 (owner: 10Effie Mouzeli)
[15:27:09] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] memcached: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156269 (owner: 10Muehlenhoff)
[15:27:18] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] memcached/gutter: Switch to firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156659 (owner: 10Muehlenhoff)
[15:27:42] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mediawiki/memcached: Switch to firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156669 (owner: 10Muehlenhoff)
[15:28:52] <wikibugs>	 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10919213 (10elukey) I am reopening this task since I assumed something about https://wikitech.wikimedia.org/wiki/SLO/Citoid without reading it cor...
[15:29:46] <urandom>	 !log decommissioning sessionstore2004-a/Cassandra — T391544
[15:29:50] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159503 (https://phabricator.wikimedia.org/T128546)
[15:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:51] <stashbot>	 T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544
[15:30:05] <jouncebot>	 jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1530).
[15:30:40] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159477|Revert "Enable temporary accounts onboarding dialog on WMF wikis"]] (duration: 24m 48s)
[15:31:39] <jan_drewniak>	 👋Just fyi, I'm actually going to do the Portals update today
[15:31:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P78065 and previous config saved to /var/cache/conftool/dbconfig/20250616-153148-fceratto.json
[15:34:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:35:51] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore2004: reimage as JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1153150 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[15:37:19] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage
[15:38:28] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:39:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:39:42] <wikibugs>	 (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159503 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:40:23] <wikibugs>	 (03PS1) 10Bking: cirrussearch: remove non-existent hosts [puppet] - 10https://gerrit.wikimedia.org/r/1159507 (https://phabricator.wikimedia.org/T388610)
[15:40:39] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159503 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:41:15] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage
[15:42:20] <wikibugs>	 (03PS1) 10Jgiannelos: RB sunset: debug spike in changeprop events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159508
[15:42:48] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10919278 (10WMDE-leszek) Thank @herron, missed part of the template it seems. It is about a time-limited contract. Updating task description
[15:43:25] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10919281 (10WMDE-leszek)
[15:43:28] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[15:43:41] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: remove non-existent hosts [puppet] - 10https://gerrit.wikimedia.org/r/1159507 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[15:44:17] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging in the interest of time (this is blocking a more important change)" [puppet] - 10https://gerrit.wikimedia.org/r/1159507 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[15:44:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum1001 [puppet] - 10https://gerrit.wikimedia.org/r/1159486 (owner: 10Ssingh)
[15:46:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P78066 and previous config saved to /var/cache/conftool/dbconfig/20250616-154656-fceratto.json
[15:47:01] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[15:47:02] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Issue a separate LE cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159510 (https://phabricator.wikimedia.org/T394484)
[15:47:09] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet
[15:47:40] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet
[15:47:54] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum1001.eqiad.wmnet with OS bookworm
[15:48:10] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Issue a separate LE cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159510 (https://phabricator.wikimedia.org/T394484)
[15:49:18] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Issue a separate GTS cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159511 (https://phabricator.wikimedia.org/T394484)
[15:49:28] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159510 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:49:33] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye
[15:49:44] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.c...
[15:51:29] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159511 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez)
[15:52:03] <wikibugs>	 (03PS1) 10Scott French: Revert^4 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1159513 (https://phabricator.wikimedia.org/T389786)
[15:52:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 10.64.48.95 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:53:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis)
[15:53:29] <logmsgbot>	 !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1159503| Bumping portals to master (T128546)]] (duration: 09m 21s)
[15:53:33] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:55:49] <logmsgbot>	 !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1159503| Bumping portals to master (T128546)]] (duration: 02m 19s)
[15:55:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97)
[15:56:27] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.175.0" for 2 host(s)
[15:57:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.48.95 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:57:13] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Revert^4 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1159513 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French)
[15:58:17] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.175.0" completed for 2 hosts
[15:58:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:59:57] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Revert^4 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1159513 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French)
[16:01:19] <wikibugs>	 (03PS2) 10Jgiannelos: RB sunset: debug spike in changeprop events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159508
[16:02:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T395241)', diff saved to https://phabricator.wikimedia.org/P78067 and previous config saved to /var/cache/conftool/dbconfig/20250616-160203-fceratto.json
[16:02:13] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[16:02:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T395241)', diff saved to https://phabricator.wikimedia.org/P78068 and previous config saved to /var/cache/conftool/dbconfig/20250616-160220-fceratto.json
[16:03:06] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage
[16:03:36] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] RB sunset: debug spike in changeprop events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159508 (owner: 10Jgiannelos)
[16:04:59] <wikibugs>	 (03PS1) 10Bking: cirrussearch: move soon-to-be-decommed hosts to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159517 (https://phabricator.wikimedia.org/T395855)
[16:05:09] <wikibugs>	 (03PS1) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994)
[16:05:41] <wikibugs>	 (03CR) 10Bking: [C:04-1] "Do not merge until we safely remove these hosts from the cluster." [puppet] - 10https://gerrit.wikimedia.org/r/1159517 (https://phabricator.wikimedia.org/T395855) (owner: 10Bking)
[16:05:48] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] RB sunset: debug spike in changeprop events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159508 (owner: 10Jgiannelos)
[16:06:40] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage
[16:07:37] <wikibugs>	 (03Merged) 10jenkins-bot: RB sunset: debug spike in changeprop events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159508 (owner: 10Jgiannelos)
[16:08:28] <wikibugs>	 (03PS1) 10Effie Mouzeli: hieradata: make wikikube-worker2100 a normal worker [puppet] - 10https://gerrit.wikimedia.org/r/1159519
[16:08:54] <wikibugs>	 (03PS1) 10Ebernhardson: Turn off glent m1 AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159520 (https://phabricator.wikimedia.org/T262612)
[16:09:26] <hnowlan>	 jouncebot: nowandnext
[16:09:26] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 50 minute(s)
[16:09:26] <jouncebot>	 In 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1700)
[16:09:27] <jouncebot>	 In 0 hour(s) and 50 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1700)
[16:09:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159520 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson)
[16:09:47] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[16:09:50] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[16:09:57] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[16:10:04] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[16:10:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[16:10:48] <wikibugs>	 (03CR) 10Muehlenhoff: site.pp: make wikikube-worker-exp* k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[16:11:05] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[16:11:47] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[16:11:53] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[16:12:18] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[16:12:24] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[16:12:33] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[16:13:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T395241)', diff saved to https://phabricator.wikimedia.org/P78069 and previous config saved to /var/cache/conftool/dbconfig/20250616-161303-fceratto.json
[16:13:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:13:32] <wikibugs>	 (03PS2) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994)
[16:14:13] <wikibugs>	 (03PS3) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994)
[16:14:44] <wikibugs>	 (03CR) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[16:18:28] <jinxer-wm>	 RESOLVED: ProbeDown: Service sessionstore2004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:21:18] <wikibugs>	 (03PS1) 10Effie Mouzeli: kubernetes.yaml: switch mw-experimental to debug image [puppet] - 10https://gerrit.wikimedia.org/r/1159524 (https://phabricator.wikimedia.org/T396767)
[16:23:27] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1001.eqiad.wmnet with OS bookworm
[16:23:57] <wikibugs>	 (03PS1) 10Clare Ming: xLab: Deploy v0.7.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159526 (https://phabricator.wikimedia.org/T392898)
[16:27:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.48.95 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:28:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P78070 and previous config saved to /var/cache/conftool/dbconfig/20250616-162810-fceratto.json
[16:30:21] <wikibugs>	 (03PS2) 10Cathal Mooney: Rename cloud-in to cloud-vrf-in [homer/public] - 10https://gerrit.wikimedia.org/r/1159415
[16:30:58] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159526 (https://phabricator.wikimedia.org/T392898) (owner: 10Clare Ming)
[16:32:28] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v0.7.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159526 (https://phabricator.wikimedia.org/T392898) (owner: 10Clare Ming)
[16:37:25] <wikibugs>	 (03PS1) 10Eevans: sessionstore: use correct partman preseed [puppet] - 10https://gerrit.wikimedia.org/r/1159530 (https://phabricator.wikimedia.org/T391544)
[16:39:41] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore: use correct partman preseed [puppet] - 10https://gerrit.wikimedia.org/r/1159530 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[16:41:07] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum1002 [puppet] - 10https://gerrit.wikimedia.org/r/1159487 (owner: 10Ssingh)
[16:41:15] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum2001 [puppet] - 10https://gerrit.wikimedia.org/r/1159488 (owner: 10Ssingh)
[16:41:58] <wikibugs>	 (03PS1) 10Hnowlan: Revert "changeprop: Remove rules related to parsoid (RB sunset)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159535 (https://phabricator.wikimedia.org/T397072)
[16:42:20] <logmsgbot>	 eevans@cumin1003 reimage (PID 1845984) is awaiting input
[16:43:17] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum1002.eqiad.wmnet with OS bookworm
[16:43:18] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum2001.codfw.wmnet with OS bookworm
[16:43:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P78071 and previous config saved to /var/cache/conftool/dbconfig/20250616-164317-fceratto.json
[16:43:59] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore2004: reconfigure for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1153151 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[16:44:25] <logmsgbot>	 andrew@cumin1002 reimage (PID 2661898) is awaiting input
[16:44:47] <wikibugs>	 (03PS2) 10Eevans: sessionstore2004: reconfigure for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1153151 (https://phabricator.wikimedia.org/T391544)
[16:45:02] <logmsgbot>	 !log eevans@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2004.codfw.wmnet with OS bullseye
[16:45:15] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw...
[16:45:30] <wikibugs>	 (03CR) 10Scott French: [C:03+1] site.pp: add wikikube-worker-exp(1001|2001) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[16:46:10] <icinga-wm>	 PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100%
[16:46:18] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye
[16:46:28] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore2004: reconfigure for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1153151 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[16:46:34] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919692 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.c...
[16:47:52] <mutante>	 sukhe: hi, that seems to be a durum issue. should I take a look?
[16:48:06] <mutante>	 talking about the icinga alert above for wikimedia-dns.org
[16:48:10] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:48:11] <sukhe>	 mutante: hi, no worries, this is a bit weird though because the other hosts should be up
[16:48:18] <sukhe>	 checking and thanks
[16:48:25] <mutante>	 sukhe: alright, thanks as well
[16:48:55] <sukhe>	 we have one each host in eqiad and codfw up for example and serving traffic
[16:49:32] <mutante>	 the DNS lookup of check.wikimedia-dns.org works for me :P
[16:50:13] <sukhe>	 so v6 ping fails from the Icinga host but v4 works. that's weird though, because the other hosts are advertising the v6 IP
[16:50:18] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-eqsin and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581)
[16:50:18] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=97) rolling upgrade of Varnish on A:cp-eqsin and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581)
[16:50:22] <stashbot>	 T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581
[16:50:23] <mutante>	 except the v6 reverse record does not exist
[16:50:23] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-eqsin and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581)
[16:51:04] <sukhe>	 hmm weird
[16:51:39] <brett>	 My bad!
[16:51:41] <mutante>	 no recent changes in DNS repo
[16:51:42] <sukhe>	 (not the v6 records)
[16:52:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10919719 (10VRiley-WMF) After looking at this unit, it seems like the server is healthy. @Eevans can you confirm these drives are actually bad? If so, which drives need to be replaced?
[16:52:43] <sukhe>	 so yeah, durum2002 is advertising the v6'es alright
[16:52:59] <mutante>	 brett:   ssh-keygen -f '/home/mutante/.ssh/known_hosts.d/wmf-prod' -R 'durum2002.codfw.wmnet'
[16:53:20] <mutante>	 eh.. that was entirely a bad paste. sorry
[16:53:26] <brett>	 okay, was confused :)
[16:53:31] <mutante>	 ignore:)
[16:53:50] <wikibugs>	 (03PS1) 10Eevans: sessionstore2004: expand configuration w/ 4 new devices [puppet] - 10https://gerrit.wikimedia.org/r/1159537 (https://phabricator.wikimedia.org/T391544)
[16:54:19] <brett>	 Are the bfd alarms intentional?
[16:54:32] <sukhe>	 they are expected for sure, given the durum hosts reimaging
[16:54:41] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1155190 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi)
[16:54:41] <brett>	 sweet, thanks
[16:54:53] <sukhe>	 10.192.32.58 for example is durum2001
[16:55:10] <logmsgbot>	 !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2004.codfw.wmnet with OS bullseye
[16:55:31] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw...
[16:55:55] <mutante>	 sukhe: curl -6 https://check.wikimedia-dns.org
[16:56:02] <mutante>	 works for me as well from durum2002
[16:56:23] <mutante>	 just not from alert1002
[16:56:25] <mutante>	 firewalling?
[16:56:26] <sukhe>	 yeah it's weird for sure... 
[16:57:05] <sukhe>	 ok let's see when the two durum hosts come up
[16:57:11] <mutante>	 not just TCP, also ICMP / ping6 is dropped
[16:57:16] <mutante>	 ack
[16:57:18] <sukhe>	 because I can't reach the v6 from Icinga but can reach v4
[16:57:18] <sukhe>	 yeah
[16:58:00] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage
[16:58:21] <wikibugs>	 (03CR) 10Effie Mouzeli: kubernetes: create mediawiki_experimental profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[16:58:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T395241)', diff saved to https://phabricator.wikimedia.org/P78072 and previous config saved to /var/cache/conftool/dbconfig/20250616-165825-fceratto.json
[16:58:48] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance
[16:58:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T395241)', diff saved to https://phabricator.wikimedia.org/P78073 and previous config saved to /var/cache/conftool/dbconfig/20250616-165855-fceratto.json
[16:59:10] <wikibugs>	 (03CR) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[16:59:55] <wikibugs>	 (03PS3) 10Effie Mouzeli: site.pp: add wikikube-worker-exp(1001|2001) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994)
[17:00:04] <jouncebot>	 swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1700).
[17:00:04] <jouncebot>	 ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1700).
[17:00:56] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "This seems to combine steps #2 and #3 from [0]. Do we actually want that? (i.e., do we want to refrain from adding to the conftool entitie" [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[17:01:10] <swfrench-wmf>	 o/
[17:01:41] <btullis>	 o/
[17:02:52] <wikibugs>	 (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1159513 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French)
[17:02:56] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Revert^4 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1159513 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French)
[17:03:11] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage
[17:03:31] <swfrench-wmf>	 btullis: I didn't realize you'd be around, thank you :)
[17:03:35] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage
[17:04:47] <wikibugs>	 (03PS4) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994)
[17:05:43] <wikibugs>	 (03PS5) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994)
[17:05:59] <wikibugs>	 (03CR) 10Effie Mouzeli: "good point!" [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[17:06:27] <btullis>	 swfrench-wmf: is it 4th time lucky? :-)
[17:06:52] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye
[17:07:05] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919781 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.c...
[17:07:12] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage
[17:07:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T395241)', diff saved to https://phabricator.wikimedia.org/P78074 and previous config saved to /var/cache/conftool/dbconfig/20250616-170726-fceratto.json
[17:07:41] <mutante>	 sukhe: on durum2002, if I do a "nft list ruleset | less" and look at the PRODUCTION_NETWORKS_ipv6 there are a bunch of networks including 2620:0:861:300:* but does it NOT cover  2620:0:861:3:* which is the one bound on alert1002? or am I not missing it
[17:07:56] <mutante>	 eh, not getting it/missing it:)
[17:11:12] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - T389786
[17:11:16] <stashbot>	 T389786: Integrate mediawiki-dumps-legacy with the regular MW scap deployments - https://phabricator.wikimedia.org/T389786
[17:11:31] <sukhe>	 mutante: durum2002 has 		ip6 saddr { 2620:0:860:2:208:80:153:42, 2620:0:860:102:10:192:16:75, 2620:0:860:103:10:192:32:67, 2620:0:860:10a:10:192:9:11, 2620:0:860:11e:10:192:39:10, 2620:0:861:3:208:80:154:78 } udp dport 1-65535 accept
[17:12:02] <sukhe>	 which covers the Icinga host I think
[17:12:49] <sukhe>	 and durum2002 works from the Icinga host, so there's that too.
[17:12:55] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - T389786 (duration: 02m 15s)
[17:13:10] <sukhe>	 so durum1002 and 2001 should be coming up
[17:13:12] <sukhe>	 let's see then
[17:13:24] <swfrench-wmf>	 btullis: I think we got it this time :)
[17:13:28] <swfrench-wmf>	 I'll follow up on the task
[17:14:11] <swfrench-wmf>	 FYI, I'm done with planned changes for the UTC-late infra window
[17:14:20] <mutante>	 yea, you are right
[17:15:26] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore2004: expand configuration w/ 4 new devices [puppet] - 10https://gerrit.wikimedia.org/r/1159537 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans)
[17:16:37] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[17:17:00] <logmsgbot>	 !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[17:20:14] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1002.eqiad.wmnet with OS bookworm
[17:20:19] <wikibugs>	 (03PS1) 10Dzahn: microsites: delete os_reports profile, remove from role [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T350794)
[17:20:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:22:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P78075 and previous config saved to /var/cache/conftool/dbconfig/20250616-172234-fceratto.json
[17:22:42] <wikibugs>	 (03PS2) 10Dzahn: microsites: delete os_reports profile, remove from role [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080)
[17:23:51] <btullis>	 swfrench-wmf: Ack, many thanks.
[17:24:50] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage
[17:25:26] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "The monitoring code needs to be moved and adjusted to k8s first.. let me merge it into miscweb::monitoring and then move that whole file e" [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[17:25:39] <icinga-wm>	 RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms
[17:25:46] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2001.codfw.wmnet with OS bookworm
[17:25:50] <wikibugs>	 (03PS1) 10Ssingh: hiera: durum: set hc for canonical domain to / [puppet] - 10https://gerrit.wikimedia.org/r/1159541
[17:25:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:25:54] <mutante>	 interesting, so it is the single instance
[17:26:31] <sukhe>	 mutante: I have not isolated it yet but it was certainly monitoring. I could access the service over v6 and well, durum2002 was up
[17:26:44] <sukhe>	 but yeah, updating the health checks above, just in case
[17:27:27] <mutante>	 so the actual issue is it should have been using the other durum instance when one goes down?
[17:27:33] <sukhe>	 yes
[17:27:37] <mutante>	 gotcha
[17:27:51] <sukhe>	 mutante: basically since it's an anycast service
[17:28:02] <sukhe>	 so all 4x durum hosts (2x per eqiad/codfw) advertise the same IPs
[17:28:02] <mutante>	 yea
[17:28:03] <wikibugs>	 (03PS1) 10Bernard Wang: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542
[17:28:07] <sukhe>	 rather, all 14 durum hosts!
[17:28:51] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage
[17:29:09] <sukhe>	 and durum2002 was up (it has not been reimaged) and _should_ have been reachable from icinga
[17:29:12] <sukhe>	 and the ping -6 was
[17:29:18] <sukhe>	 and even the host itself
[17:29:24] <sukhe>	 but why not the domain specifically, not sure
[17:29:50] <sukhe>	 and the DNS record for it is not DYNA or anything, so it does not depend on where it is coming from
[17:31:43] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5969/c" [puppet] - 10https://gerrit.wikimedia.org/r/1159541 (owner: 10Ssingh)
[17:32:38] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] hiera: durum: set hc for canonical domain to / [puppet] - 10https://gerrit.wikimedia.org/r/1159541 (owner: 10Ssingh)
[17:32:51] <sukhe>	 thanks brett
[17:33:00] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: durum: set hc for canonical domain to / [puppet] - 10https://gerrit.wikimedia.org/r/1159541 (owner: 10Ssingh)
[17:33:15] <sukhe>	 !log disable puppet on A:durum to roll out CR 1159541
[17:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:28] <wikibugs>	 (03CR) 10Scott French: [C:03+1] site.pp: add wikikube-worker-exp(1001|2001) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[17:34:53] <mutante>	 sukhe: possibly this: "When Prometheus (or another client) queries the Alertmanager anycast DNS address for health status, it will only reach the closest instance. "
[17:35:15] <mutante>	 arr, that's AI though making this claim, sorry
[17:35:25] <sukhe>	 mutante: hah well in this case, I did query the direct IP as well
[17:35:32] <wikibugs>	 (03CR) 10Scott French: [C:03+1] site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[17:36:05] <sukhe>	 !log sudo cumin -b1 -s10 'A:durum' 'run-puppet-agent --enable "merging CR 1159541"'
[17:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:06] <icinga-wm>	 PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100%
[17:37:23] <sukhe>	 ^ ha
[17:37:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P78076 and previous config saved to /var/cache/conftool/dbconfig/20250616-173741-fceratto.json
[17:37:43] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Effie! Agreed with the commit message that a reimage is the cleanest cleanup option." [puppet] - 10https://gerrit.wikimedia.org/r/1159519 (owner: 10Effie Mouzeli)
[17:37:44] <brett>	 :(
[17:37:47] <sukhe>	 so very clearly for some reaosn, it only cares about durum1001
[17:37:50] <sukhe>	 that's a fun discovery
[17:39:21] <mutante>	 hrmm... #netops #anycast_routing 
[17:39:28] <sukhe>	 ha yeah
[17:39:34] <sukhe>	 will first debug and then see
[17:42:52] <sukhe>	 should be up soon but yeah, does not answer the question of why it does not try to reach codfw
[17:43:25] <sukhe>	 I know we do't advertise prefixes from PoPs to core but this is not that
[17:43:55] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:44:14] <icinga-wm>	 RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms
[17:45:18] <wikibugs>	 (03PS3) 10Cwhite: logstash: Reroute apache.access logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/1153279 (https://phabricator.wikimedia.org/T390215)
[17:45:42] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Effie! Feel free to merge without an additional round of review from me once you resolve (or decide to defer) the open comment abo" [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[17:47:45] <wikibugs>	 (03PS1) 10Dzahn: microsites: adjust monitoring for os_reports to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1159545 (https://phabricator.wikimedia.org/T397080)
[17:48:27] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "before this, there should be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1159545" [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[17:49:31] <jinxer-wm>	 FIRING: ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:50:26] <brett>	 urandom: Expected?
[17:50:30] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum2002 [puppet] - 10https://gerrit.wikimedia.org/r/1159489 (owner: 10Ssingh)
[17:50:39] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum2002.codfw.wmnet with OS bookworm
[17:50:44] * swfrench-wmf is willing to guess yes, but confirmation would be good
[17:50:57] <sukhe>	 urandom was working on this host re: partman issues 
[17:51:12] <sukhe>	 https://sal.toolforge.org/log/q-tceZcB8tZ8Ohr0YV3e
[17:51:19] <sukhe>	 so I would say yes
[17:52:06] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum6001 [puppet] - 10https://gerrit.wikimedia.org/r/1159494 (owner: 10Ssingh)
[17:52:27] <wikibugs>	 (03CR) 10AOkoth: microsites: adjust monitoring for os_reports to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159545 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[17:52:37] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2004.codfw.wmnet with OS bullseye
[17:52:44] <sukhe>	 brett: ^
[17:52:48] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919958 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sess...
[17:52:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T395241)', diff saved to https://phabricator.wikimedia.org/P78077 and previous config saved to /var/cache/conftool/dbconfig/20250616-175248-fceratto.json
[17:53:10] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance
[17:53:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T395241)', diff saved to https://phabricator.wikimedia.org/P78078 and previous config saved to /var/cache/conftool/dbconfig/20250616-175317-fceratto.json
[17:53:28] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:54:32] <wikibugs>	 (03CR) 10Ssingh: "I would say that Colombia, Venezuela, Ecuador are the only ones we should consider merging. For the rest, let's iterate one by one. Let me" [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins)
[17:55:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[17:57:06] <icinga-wm>	 PROBLEM - Host sessionstore2004 is DOWN: PING CRITICAL - Packet loss = 100%
[17:58:34] <icinga-wm>	 RECOVERY - Host sessionstore2004 is UP: PING OK - Packet loss = 0%, RTA = 30.55 ms
[17:59:30] <sukhe>	 bird/BGP/BFD alerts are expected for the durum hosts. I will point out the non-obvious oes.
[17:59:33] <sukhe>	 *ones
[18:00:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:00:18] <sukhe>	 ^ expected
[18:00:19] <wikibugs>	 (03CR) 10Scott French: [C:03+1] kubernetes.yaml: switch mw-experimental to debug image [puppet] - 10https://gerrit.wikimedia.org/r/1159524 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[18:00:54] <wikibugs>	 (03PS2) 10Dzahn: microsites: adjust monitoring for os_reports to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1159545 (https://phabricator.wikimedia.org/T397080)
[18:01:05] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Add lvs1016 to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall)
[18:01:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T395241)', diff saved to https://phabricator.wikimedia.org/P78079 and previous config saved to /var/cache/conftool/dbconfig/20250616-180141-fceratto.json
[18:04:25] <wikibugs>	 (03CR) 10Dzahn: microsites: adjust monitoring for os_reports to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159545 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[18:04:28] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 6 hosts with reason: begin decom/remove hosts from cluster
[18:04:50] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] hiera: Add lvs1016 to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall)
[18:05:02] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] microsites: adjust monitoring for os_reports to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1159545 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[18:06:46] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye
[18:07:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum4001 [puppet] - 10https://gerrit.wikimedia.org/r/1159490 (owner: 10Ssingh)
[18:08:23] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS bookworm
[18:08:25] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage
[18:12:27] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage
[18:13:30] <urandom>	 !log bootstrapping sessionstore2004-a/Cassandra — T390514
[18:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:15:10] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:15:38] <sukhe>	 ^ expected
[18:16:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P78080 and previous config saved to /var/cache/conftool/dbconfig/20250616-181649-fceratto.json
[18:20:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920023 (10BCornwall) Okay, so we're ready to reimage lvs1016 but it appears that the mgmt interface isn't reachable. Could dcops look into this, please?
[18:22:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920042 (10BCornwall)
[18:23:38] <wikibugs>	 (03PS3) 10Dzahn: microsites: delete os_reports profile, remove from role [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080)
[18:28:38] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage
[18:28:59] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] microsites: delete os_reports profile, remove from role [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[18:29:42] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2002.codfw.wmnet with OS bookworm
[18:30:10] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:30:24] <sukhe>	 ^ going away shortly
[18:31:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P78081 and previous config saved to /var/cache/conftool/dbconfig/20250616-183156-fceratto.json
[18:32:15] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage
[18:32:30] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10920067 (10jcrespo) Thanks, that's more insightful and helpful, I will give it a think and maybe talk to Matthew and will try to w...
[18:33:49] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wikimedia-dns.org: remove TYPE65 record for check [dns] - 10https://gerrit.wikimedia.org/r/1156373 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:35:10] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:39:14] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] microsites: delete os_reports profile, remove from role [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn)
[18:41:41] <sukhe>	 checking why 10.192.48.14 continues to be a problem
[18:41:52] <sukhe>	 it is advertising all the right things, BGP session is up
[18:43:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:43:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10920086 (10KFrancis) @herron I am confirming an NDA is on file for Andrew Green.  Thanks!
[18:43:34] <sukhe>	 oh yeah, it did clear up
[18:43:38] <sukhe>	 artifcat
[18:45:03] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[18:47:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T395241)', diff saved to https://phabricator.wikimedia.org/P78082 and previous config saved to /var/cache/conftool/dbconfig/20250616-184704-fceratto.json
[18:47:15] <wikibugs>	 (03PS1) 10BCornwall: Revert "hiera: Add lvs1016 to high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/1159549
[18:47:24] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2221.codfw.wmnet with reason: Maintenance
[18:47:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T395241)', diff saved to https://phabricator.wikimedia.org/P78083 and previous config saved to /var/cache/conftool/dbconfig/20250616-184731-fceratto.json
[18:47:34] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert "hiera: Add lvs1016 to high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/1159549 (owner: 10BCornwall)
[18:48:55] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] Revert "hiera: Add lvs1016 to high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/1159549 (owner: 10BCornwall)
[18:50:07] <wikibugs>	 (03CR) 10Kimberly Sarabia: "is there a ticket for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (owner: 10Bernard Wang)
[18:50:27] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4001.ulsfo.wmnet with OS bookworm
[18:50:43] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum4002 [puppet] - 10https://gerrit.wikimedia.org/r/1159491 (owner: 10Ssingh)
[18:50:43] <wikibugs>	 (03PS1) 10Dzahn: miscweb: delete miscweb::rsync profile [puppet] - 10https://gerrit.wikimedia.org/r/1159550 (https://phabricator.wikimedia.org/T397080)
[18:53:17] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum4002.ulsfo.wmnet with OS bookworm
[18:55:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:56:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T395241)', diff saved to https://phabricator.wikimedia.org/P78084 and previous config saved to /var/cache/conftool/dbconfig/20250616-185600-fceratto.json
[18:58:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:00:06] <wikibugs>	 (03CR) 10Scott French: "Thanks, Effie! One lingering issue and optional simplification you should feel free to defer. Please feel free to merge without additional" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli)
[19:01:49] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum5001 [puppet] - 10https://gerrit.wikimedia.org/r/1159492 (owner: 10Ssingh)
[19:02:31] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum5001.eqsin.wmnet with OS bookworm
[19:07:14] <wikibugs>	 (03CR) 10CDobbins: "Sounds good to me. I'll update the change." [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins)
[19:08:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:08:28] <jinxer-wm>	 RESOLVED: ProbeDown: Service sessionstore2004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:08:32] <sukhe>	 I did silence the above :P 
[19:09:46] <wikibugs>	 (03PS11) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334
[19:11:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P78085 and previous config saved to /var/cache/conftool/dbconfig/20250616-191108-fceratto.json
[19:11:09] <wikibugs>	 (03CR) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[19:12:24] <wikibugs>	 (03PS3) 10Ssingh: wikimedia-dns.org: remove TYPE65 record for check [dns] - 10https://gerrit.wikimedia.org/r/1156373 (https://phabricator.wikimedia.org/T205378)
[19:12:32] <wikibugs>	 (03CR) 10Ssingh: "Rebased, no code change." [dns] - 10https://gerrit.wikimedia.org/r/1156373 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[19:13:44] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage
[19:14:11] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: remove TYPE65 record for check [dns] - 10https://gerrit.wikimedia.org/r/1156373 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[19:14:23] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[19:15:16] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[19:16:45] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage
[19:19:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10920139 (10AndyRussG_volunteer) Thanks so much, @WMDE-leszek, @herron, @KFrancis, hugely appreciated.  - I signed L3 with using this, my volunteer, account. (As you can see,...
[19:26:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P78086 and previous config saved to /var/cache/conftool/dbconfig/20250616-192615-fceratto.json
[19:34:16] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4002.ulsfo.wmnet with OS bookworm
[19:34:42] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: Reroute apache.access logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/1153279 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite)
[19:35:21] <wikibugs>	 (03PS1) 10Zabe: Stop setting wgRevisionSlotsCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159552 (https://phabricator.wikimedia.org/T183490)
[19:41:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T395241)', diff saved to https://phabricator.wikimedia.org/P78087 and previous config saved to /var/cache/conftool/dbconfig/20250616-194123-fceratto.json
[19:41:33] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2222.codfw.wmnet with reason: Maintenance
[19:41:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T395241)', diff saved to https://phabricator.wikimedia.org/P78088 and previous config saved to /var/cache/conftool/dbconfig/20250616-194140-fceratto.json
[19:42:34] <wikibugs>	 (03PS1) 10Bvibber: Update chart-renderer in production to latest merge build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159556 (https://phabricator.wikimedia.org/T395165)
[19:45:13] <wikibugs>	 (03CR) 10Bvibber: [C:03+2] Update chart-renderer in production to latest merge build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159556 (https://phabricator.wikimedia.org/T395165) (owner: 10Bvibber)
[19:45:40] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage
[19:46:58] <wikibugs>	 (03Merged) 10jenkins-bot: Update chart-renderer in production to latest merge build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159556 (https://phabricator.wikimedia.org/T395165) (owner: 10Bvibber)
[19:49:24] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage
[19:50:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T395241)', diff saved to https://phabricator.wikimedia.org/P78089 and previous config saved to /var/cache/conftool/dbconfig/20250616-195004-fceratto.json
[19:50:29] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp70[02-16].magru.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581)
[19:50:33] <stashbot>	 T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581
[19:50:56] <logmsgbot>	 !log bvibber@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply
[19:51:29] <logmsgbot>	 !log bvibber@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply
[19:51:39] <logmsgbot>	 !log bvibber@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply
[19:52:05] <logmsgbot>	 !log bvibber@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply
[19:52:13] <logmsgbot>	 !log bvibber@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply
[19:52:43] <logmsgbot>	 !log bvibber@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply
[19:59:29] <wikibugs>	 (03PS1) 10Bvibber: Quiet test rollout of Lua transforms for Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159559 (https://phabricator.wikimedia.org/T388616)
[19:59:31] <icinga-wm>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T2000). nyaa~
[20:00:05] <jouncebot>	 Nemoralis, arlolra, EggRoll97, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:03:40] <ebernhardson>	 \o
[20:03:53] <arlolra>	 here
[20:04:16] <EggRoll97>	 here
[20:05:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P78090 and previous config saved to /var/cache/conftool/dbconfig/20250616-200512-fceratto.json
[20:05:27] <zabe>	 I can deploy if no one else is here
[20:06:01] <arlolra>	 I can handle my own deploy and am also willing to do for others
[20:07:26] <zabe>	 Alright, feel free to do it then
[20:07:51] <arlolra>	 Is Nemoralis around?
[20:08:09] <arlolra>	 If not, I'll get started with mine
[20:08:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156515 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra)
[20:09:07] <brett>	 !log restarting pybal on lvs1020
[20:09:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:31] <icinga-wm>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[20:09:46] <wikibugs>	 (03Merged) 10jenkins-bot: Disable VipsScaler in group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156515 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra)
[20:09:47] <brett>	 !log restarting pybal on lvs1017
[20:09:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:01] <logmsgbot>	 !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1156515|Disable VipsScaler in group2 (T290759)]]
[20:10:06] <stashbot>	 T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759
[20:11:59] <logmsgbot>	 !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1156515|Disable VipsScaler in group2 (T290759)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:12:01] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5001.eqsin.wmnet with OS bookworm
[20:12:47] <dancy>	 Looks like I broke something in the SpiderPig job log viewer.  Looking into it.
[20:13:00] <arlolra>	 Thanks
[20:13:06] <logmsgbot>	 !log arlolra@deploy1003 arlolra: Continuing with sync
[20:13:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:13:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:16:49] <sukhe>	 ^ last alert, artifact from earlier. should resolve soon. the BFD one that is
[20:17:40] <ryankemper>	 !log T395855 Stopped opensearch units on `cirrussearch205[7,8]` (row B decom hosts)
[20:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:44] <stashbot>	 T395855: Decommission cirrussearch2055-2060 - https://phabricator.wikimedia.org/T395855
[20:18:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:19:54] <logmsgbot>	 !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156515|Disable VipsScaler in group2 (T290759)]] (duration: 09m 53s)
[20:20:00] <stashbot>	 T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759
[20:20:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P78091 and previous config saved to /var/cache/conftool/dbconfig/20250616-202019-fceratto.json
[20:21:26] <arlolra>	 EggRoll97: You're up next
[20:21:31] <EggRoll97>	 Yep
[20:26:20] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon2006-dev.codfw.wmnet with OS bullseye
[20:27:48] <arlolra>	 EggRoll97: Is there any sort of review process that needs to happen for ipinfo-view-full to be granted?
[20:28:33] <EggRoll97>	 arlolra: Shouldn't be, afaik it's redundant without checkuser-temporary-account
[20:29:34] <EggRoll97>	 And given arbcom is elected and ipinfo-view-full is a subset of admin I didn't see any problem with it at the time, only checkuser-temporary-account is specifically blocked from being added to other groups in Limits to config changes
[20:29:35] <arlolra>	 Just that none of the other arbcom have that
[20:32:00] <EggRoll97>	 I think the other arbcoms were created before ipinfo-view-full was necessarily relevant to usergroups
[20:32:18] <EggRoll97>	 Arbcom groups*, sorry
[20:32:53] <arlolra>	 It looks like zhwiki opted to deploy without it to start
[20:32:53] <arlolra>	 https://phabricator.wikimedia.org/T374455#10136177
[20:33:13] <EggRoll97>	 I see, will do
[20:33:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920481 (10VRiley-WMF) @BCornwall Hey there, thanks for letting us know. I did replace the cable and it seems to respond to ping. Would you be able to check again? It seems to...
[20:34:31] <arlolra>	 EggRoll97: Also, https://phabricator.wikimedia.org/T374528
[20:34:42] <arlolra>	 oathauth-enable might be unnecessary
[20:35:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T395241)', diff saved to https://phabricator.wikimedia.org/P78092 and previous config saved to /var/cache/conftool/dbconfig/20250616-203526-fceratto.json
[20:36:57] <EggRoll97>	 arlolra: oathauth-enable being removed in T374528 only appears to affect itwiki and newiki, and arbcom wouldn't necessarily be a privileged group (especially if the arbcom members arent in the sysop group or similar) may not be redundant yet
[20:36:58] <stashbot>	 T374528: Remove redundant oathauth-enable flag - https://phabricator.wikimedia.org/T374528
[20:38:22] <arlolra>	 Ok
[20:38:40] <arlolra>	 Do you want me to amend the patch or will you push PS2?
[20:39:50] <wikibugs>	 (03PS12) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334
[20:39:56] <EggRoll97>	 No preference either way, it may take me a couple minutes to push PS2 unless you amend it
[20:41:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920518 (10BCornwall) @VRiley-WMF Thanks for the quick response! I've not been able to ping the mgmt interface (10.65.0.75)  from lvs1017, cumin1002, and cumin2002. It's timin...
[20:41:19] <wikibugs>	 (03PS2) 10Arlolra: Add arbcom group to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97)
[20:41:31] <arlolra>	 EggRoll97: done
[20:42:07] <wikibugs>	 (03CR) 10EggRoll97: [C:03+1] Add arbcom group to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97)
[20:42:21] <EggRoll97>	 arlolra: thanks
[20:42:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97)
[20:43:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add arbcom group to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97)
[20:43:35] <logmsgbot>	 !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1155945|Add arbcom group to ukwiki (T396668)]]
[20:43:40] <stashbot>	 T396668: Add user group arbcom to ukwiki - https://phabricator.wikimedia.org/T396668
[20:44:12] <wikibugs>	 (03PS6) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085
[20:44:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins)
[20:45:07] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:45:30] <logmsgbot>	 !log arlolra@deploy1003 arlolra, eggroll97: Backport for [[gerrit:1155945|Add arbcom group to ukwiki (T396668)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:45:32] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage
[20:45:54] <wikibugs>	 (03PS7) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085
[20:46:26] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:46:33] <arlolra>	 EggRoll97: Is there anything you want to check on the test servers?
[20:46:49] <arlolra>	 I assume not
[20:46:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:47:05] <EggRoll97>	 arlolra: be anything, just a usergroup addition
[20:47:09] <EggRoll97>	 shouldnt be anything*
[20:47:14] <logmsgbot>	 !log arlolra@deploy1003 arlolra, eggroll97: Continuing with sync
[20:48:11] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:48:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:48:48] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage
[20:49:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza)
[20:49:29] <wikibugs>	 (03PS1) 10Btullis: Add our legacy archiva instance to kubernetes external_services [puppet] - 10https://gerrit.wikimedia.org/r/1159563 (https://phabricator.wikimedia.org/T392244)
[20:49:48] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:52:05] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5971/co" [puppet] - 10https://gerrit.wikimedia.org/r/1159563 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis)
[20:52:31] <wikibugs>	 (03PS1) 10Dzahn: site: move legacy miscweb VMs to insetup-role [puppet] - 10https://gerrit.wikimedia.org/r/1159564 (https://phabricator.wikimedia.org/T397080)
[20:54:12] <logmsgbot>	 !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155945|Add arbcom group to ukwiki (T396668)]] (duration: 10m 36s)
[20:54:16] <stashbot>	 T396668: Add user group arbcom to ukwiki - https://phabricator.wikimedia.org/T396668
[20:54:26] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[20:55:05] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1159563 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis)
[20:55:25] <arlolra>	 ebernhardson: You're up
[20:55:51] <ebernhardson>	 kk
[20:56:24] <arlolra>	 Do you want me to do it?
[20:56:36] <ebernhardson>	 sure
[20:56:39] <icinga-wm>	 PROBLEM - Host thanos-be2006 is DOWN: PING CRITICAL - Packet loss = 100%
[20:56:45] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[20:56:57] <icinga-wm>	 RECOVERY - Host thanos-be2006 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms
[20:57:08] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[20:57:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920555 (10VRiley-WMF) Okay, I found the problem (I pinged the incorrect IP) I set the IP address on the iDRAC to the one listed in netbox. I just tested out the ping and it s...
[20:57:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159520 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson)
[20:59:00] <wikibugs>	 (03Merged) 10jenkins-bot: Turn off glent m1 AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159520 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson)
[20:59:12] <wikibugs>	 (03PS6) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309)
[20:59:14] <logmsgbot>	 !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1159520|Turn off glent m1 AB test (T262612)]]
[20:59:18] <stashbot>	 T262612: Run an A/B test using suggestions generated using glent Method 1 - https://phabricator.wikimedia.org/T262612
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T2100).
[21:00:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-worker1185/1186 - jclark@cumin1002"
[21:00:35] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-worker1185/1186 - jclark@cumin1002"
[21:00:35] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:01:20] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:01:21] <logmsgbot>	 !log arlolra@deploy1003 ebernhardson, arlolra: Backport for [[gerrit:1159520|Turn off glent m1 AB test (T262612)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:02:00] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:02:00] <ebernhardson>	 arlolra: looks reasonable on the test servers
[21:02:08] <arlolra>	 Thanks
[21:02:19] <logmsgbot>	 !log arlolra@deploy1003 ebernhardson, arlolra: Continuing with sync
[21:02:29] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert "Add scrambled: password class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360)
[21:02:51] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1185
[21:02:59] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1185
[21:03:05] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186
[21:03:05] <wikibugs>	 (03CR) 10Gergő Tisza: [C:04-2] "need to wait a week for I8ea7234cf9b470bd180edfaedec31a3220a81bb4 to be fully deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360) (owner: 10Gergő Tisza)
[21:03:14] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186
[21:03:37] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:03:46] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <INSERT LDAP GROUP> for <INSERT USERNAME> - https://phabricator.wikimedia.org/T397099 (10DerHexer) 03NEW
[21:04:45] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <INSERT LDAP GROUP> for DerHexer - https://phabricator.wikimedia.org/T397099#10920591 (10DerHexer)
[21:04:56] <logmsgbot>	 !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[21:05:06] <logmsgbot>	 !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[21:05:06] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:05:48] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <INSERT LDAP GROUP> for DerHexer - https://phabricator.wikimedia.org/T397099#10920594 (10Astinson) DerHexer is a long-trusted Steward that wants access to some of the data that is available through Central Notice, he has an existing NDA with the Foundation.
[21:06:06] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2006-dev.codfw.wmnet with OS bullseye
[21:09:07] <logmsgbot>	 !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159520|Turn off glent m1 AB test (T262612)]] (duration: 09m 53s)
[21:09:11] <stashbot>	 T262612: Run an A/B test using suggestions generated using glent Method 1 - https://phabricator.wikimedia.org/T262612
[21:10:01] <bvibber>	 anything still need backports or am i free to sneak in a config patch?
[21:10:51] <arlolra>	 I think we're done.  We've bled into the security deployment window, not sure if that's needed today or not though
[21:10:51] <sbassett>	 Hey all - would like to deploy 2 security patches during the window.  Has the backport window wrapped up?
[21:11:21] <icinga-wm>	 PROBLEM - Host thanos-be2006 is DOWN: PING CRITICAL - Packet loss = 100%
[21:12:16] <arlolra>	 sbassett, bvibber: I will leave it to you to sort out
[21:13:02] <bvibber>	 go for it
[21:13:18] <bvibber>	 mine's an experimental feature we're rolling out wider for testing, no rush on it :)
[21:13:37] <wikibugs>	 (03PS2) 10Btullis: Add our legacy archiva instance to kubernetes external_services [puppet] - 10https://gerrit.wikimedia.org/r/1159563 (https://phabricator.wikimedia.org/T392244)
[21:15:50] <wikibugs>	 (03PS1) 10Btullis: Allow blunderbuss to contact archiva [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159579 (https://phabricator.wikimedia.org/T392244)
[21:15:56] <sbassett>	 bvibber: sounds good, thanks.  this should really only take 20 mins or so, so there’d be time after I’d be happy to turn back over to you.
[21:16:12] <bvibber>	 awesome :)
[21:16:47] <wikibugs>	 (03PS1) 10Tchanders: Revert "Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159580
[21:17:09] <wikibugs>	 (03PS2) 10Tchanders: Revert "Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159580 (https://phabricator.wikimedia.org/T376315)
[21:18:51] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10920618 (10Astinson)
[21:20:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:20:24] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:20:41] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:21:31] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:21:32] <wikibugs>	 (03PS7) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309)
[21:21:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[21:26:46] <wikibugs>	 (03PS8) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309)
[21:27:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[21:28:20] <wikibugs>	 (03PS9) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309)
[21:28:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[21:30:45] <wikibugs>	 (03PS10) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309)
[21:31:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[21:32:36] <sbassett>	 !log Deployed security fix for T396946
[21:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:43] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159579 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis)
[21:37:29] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:37:54] <wikibugs>	 (03PS11) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309)
[21:38:54] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:39:12] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:40:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[21:41:52] <sbassett>	 bvibber: ok, all done.  feel free to use the rest of the sec deployment window.
[21:42:17] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:42:52] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:45:10] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:45:26] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:46:34] <bvibber>	 tx
[21:46:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159559 (https://phabricator.wikimedia.org/T388616) (owner: 10Bvibber)
[21:47:02] <wikibugs>	 (03PS12) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309)
[21:47:34] <wikibugs>	 (03Merged) 10jenkins-bot: Quiet test rollout of Lua transforms for Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159559 (https://phabricator.wikimedia.org/T388616) (owner: 10Bvibber)
[21:51:12] <bvibber>	 anybody know what's failing with the scap?
[21:51:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye
[21:51:31] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10920716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2006.codfw.wmnet with OS bu...
[21:52:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920717 (10BCornwall) Thank you!
[21:53:47] <wikibugs>	 (03PS1) 10BCornwall: Revert^2 "hiera: Add lvs1016 to high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/1159592
[21:54:37] <logmsgbot>	 !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1159559|Quiet test rollout of Lua transforms for Charts (T388616)]]
[21:54:41] <stashbot>	 T388616: Expose Data: Lua filter interface to Charts via the .chart format setup - https://phabricator.wikimedia.org/T388616
[21:55:39] <bvibber>	 (we found it)
[21:56:34] <logmsgbot>	 !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1159559|Quiet test rollout of Lua transforms for Charts (T388616)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:57:31] <logmsgbot>	 !log bvibber@deploy1003 bvibber: Continuing with sync
[21:58:05] <icinga-wm>	 RECOVERY - Host thanos-be2006 is UP: PING WARNING - Packet loss = 71%, RTA = 37.23 ms
[22:00:43] <wikibugs>	 (03PS13) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309)
[22:04:17] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: fork SLOs for wdqs-main and wdqs-scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966)
[22:04:56] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs: fork SLOs for wdqs-main and wdqs-scholarly (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[22:05:00] <logmsgbot>	 !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159559|Quiet test rollout of Lua transforms for Charts (T388616)]] (duration: 10m 22s)
[22:05:05] <stashbot>	 T388616: Expose Data: Lua filter interface to Charts via the .chart format setup - https://phabricator.wikimedia.org/T388616
[22:06:06] <logmsgbot>	 jhancock@cumin2002 reimage (PID 676792) is awaiting input
[22:07:37] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: fork SLOs for wdqs-main and wdqs-scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966)
[22:08:43] <bvibber>	 all done
[22:09:13] <wikibugs>	 (03CR) 10Ryan Kemper: "Also added the comment to explain the queries in plain english, hope it makes some sort of sense :P" [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[22:11:54] <wikibugs>	 (03PS14) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309)
[22:12:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.448s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:14:56] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1185
[22:15:04] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1185
[22:15:10] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186
[22:15:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186
[22:16:12] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:17:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.272s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:20:33] <wikibugs>	 (03CR) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[22:21:34] <wikibugs>	 (03CR) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)
[22:34:50] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:35:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye
[22:35:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10920824 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1185.eqiad.wmnet with OS b...
[22:42:32] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10920833 (10IAckerman-WMF) I support DerHexer's NDA LDAP access so they can evaluate their fundraising banner performance.
[22:48:55] <wikibugs>	 (03PS1) 10Arlolra: Undeploy VipsScaler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159599 (https://phabricator.wikimedia.org/T290759)
[22:51:34] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye
[22:51:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10920858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls...
[22:55:41] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye
[22:55:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10920878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1185.eqiad.wmnet with OS b...
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T2300)
[23:00:10] <jinxer-wm>	 FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[23:05:10] <jinxer-wm>	 RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure
[23:10:35] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1185.eqiad.wmnet with reason: host reimage
[23:14:17] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1185.eqiad.wmnet with reason: host reimage
[23:31:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye
[23:31:26] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:31:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:31:56] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10920931 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host thanos-be2006.codfw.wmnet with OS bullse...
[23:31:57] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1185.eqiad.wmnet with OS bullseye
[23:32:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10920932 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls...
[23:36:37] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10920933 (10Jhancock.wm) @MatthewVernon I have tried cryptographically wiping the drives but I still can't get a puppet run to complete on these two...
[23:38:31] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1159608
[23:38:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1159608 (owner: 10TrainBranchBot)
[23:47:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:50:14] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1159608 (owner: 10TrainBranchBot)
[23:56:56] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Simplify $wgContactConfig required checkboxes validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159610