[00:05:10] FIRING: [2x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1158932 [00:07:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1158932 (owner: 10TrainBranchBot) [00:10:12] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:22:45] FIRING: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:27:37] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1158932 (owner: 10TrainBranchBot) [00:52:45] FIRING: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:57:45] RESOLVED: [6x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:33:29] ACKNOWLEDGEMENT - MD RAID on aqs1012 is CRITICAL: CRITICAL: State: degraded, Active: 8, Working: 8, Failed: 4, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T396970 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:33:38] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970 (10ops-monitoring-bot) 03NEW [02:01:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:06:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:09:12] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:13:03] 06SRE, 06SRE Observability: monitoring ACKs should be delivered via SMS - https://phabricator.wikimedia.org/T396894#10916279 (10lmata) [02:22:18] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:25:18] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:31:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:57:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:14:28] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:17:28] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:05:10] FIRING: [2x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) (owner: 10Anzx) [04:14:28] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:17:28] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:52:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2161.codfw.wmnet with reason: Maintenance [04:57:31] (03PS1) 10Marostegui: db2204: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159067 (https://phabricator.wikimedia.org/T396549) [04:57:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2204 T396549', diff saved to https://phabricator.wikimedia.org/P77957 and previous config saved to /var/cache/conftool/dbconfig/20250616-045738-marostegui.json [04:57:43] T396549: Migrate s2 to MariaDB 10.11 - https://phabricator.wikimedia.org/T396549 [04:58:07] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2204.codfw.wmnet with reason: Maintenance [04:58:13] (03CR) 10Marostegui: [C:03+2] db2204: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159067 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [05:01:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77958 and previous config saved to /var/cache/conftool/dbconfig/20250616-050139-root.json [05:02:52] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1159072 (https://phabricator.wikimedia.org/T396976) [05:06:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [05:06:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:06:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T396130)', diff saved to https://phabricator.wikimedia.org/P77959 and previous config saved to /var/cache/conftool/dbconfig/20250616-050637-marostegui.json [05:06:41] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77960 and previous config saved to /var/cache/conftool/dbconfig/20250616-051644-root.json [05:20:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:25:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T396130)', diff saved to https://phabricator.wikimedia.org/P77961 and previous config saved to /var/cache/conftool/dbconfig/20250616-052530-marostegui.json [05:25:34] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [05:29:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10916392 (10Stevemunene) Did the raid) config with ` stevemunene@an-worker1157:~$ sudo perccli64 /c0 add vd each r0 w... [05:31:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77962 and previous config saved to /var/cache/conftool/dbconfig/20250616-053150-root.json [05:33:37] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1160.eqiad.wmnet [05:35:08] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1160.eqiad.wmnet [05:35:43] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1162.eqiad.wmnet [05:37:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156741 (https://phabricator.wikimedia.org/T395871) (owner: 10Svantje Lilienthal) [05:37:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10916401 (10Stevemunene) [05:37:23] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1162.eqiad.wmnet [05:38:47] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1161.eqiad.wmnet [05:40:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P77963 and previous config saved to /var/cache/conftool/dbconfig/20250616-054037-marostegui.json [05:41:42] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1161.eqiad.wmnet [05:42:19] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1161.eqiad.wmnet [05:42:51] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1161.eqiad.wmnet [05:43:31] (03CR) 10Marostegui: [C:03+1] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1156191 (owner: 10Muehlenhoff) [05:43:50] (03CR) 10Marostegui: [C:03+1] conftool: Allow es6 and es7 being set to read only via dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1153723 (https://phabricator.wikimedia.org/T395696) (owner: 10Ladsgroup) [05:44:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10916407 (10Stevemunene) Did thd raid config with ` stevemunene@an-worker1160:~$ sudo perccli64 /c0 add vd each r0 wb r... [05:46:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77964 and previous config saved to /var/cache/conftool/dbconfig/20250616-054656-root.json [05:48:58] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 151326 [05:49:12] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 151326 [05:55:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P77965 and previous config saved to /var/cache/conftool/dbconfig/20250616-055545-marostegui.json [05:56:58] (03PS1) 10Stevemunene: hdfs: Add group 7_8 remove group 9_10 hosts from cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159102 (https://phabricator.wikimedia.org/T390174) [05:57:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10916421 (10Stevemunene) [05:58:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10916422 (10Stevemunene) a:03Stevemunene [05:58:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10916424 (10Stevemunene) a:03Stevemunene [06:02:02] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#10916427 (10ayounsi) [06:02:49] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: cirrussearch2079 iDRAC not working - https://phabricator.wikimedia.org/T396718#10916429 (10ayounsi) I added the #data-platform-sre tag to the task, I think @bking was recently working on those hosts. [06:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:10:10] (03PS4) 10Stevemunene: zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922) [06:10:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T396130)', diff saved to https://phabricator.wikimedia.org/P77966 and previous config saved to /var/cache/conftool/dbconfig/20250616-061053-marostegui.json [06:10:57] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:11:08] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:25:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [06:25:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T396130)', diff saved to https://phabricator.wikimedia.org/P77967 and previous config saved to /var/cache/conftool/dbconfig/20250616-062536-marostegui.json [06:25:40] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:27:05] 10SRE-tools, 06Infrastructure-Foundations: decommission cookbook: add support for decom spreadsheet - https://phabricator.wikimedia.org/T244315#10916607 (10Volans) 05Open→03Resolved a:03Volans Sounds good! Resolving this, happy to discuss further improvements whenever you want. [06:31:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:41:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T396130)', diff saved to https://phabricator.wikimedia.org/P77968 and previous config saved to /var/cache/conftool/dbconfig/20250616-064117-marostegui.json [06:41:22] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [06:46:08] (03CR) 10Muehlenhoff: [C:03+2] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1156191 (owner: 10Muehlenhoff) [06:47:14] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts install7001.wikimedia.org [06:47:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:50:20] jmm@cumin1003 decommission (PID 1791650) is awaiting input [06:52:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:53:37] (03PS1) 10Muehlenhoff: Remove obsolete keytab [labs/private] - 10https://gerrit.wikimedia.org/r/1159125 [06:54:21] (03CR) 10Brouberol: zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [06:54:54] (03CR) 10Brouberol: [C:03+1] hdfs: Add group 7_8 remove group 9_10 hosts from cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159102 (https://phabricator.wikimedia.org/T390174) (owner: 10Stevemunene) [06:55:41] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [06:56:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77969 and previous config saved to /var/cache/conftool/dbconfig/20250616-065625-marostegui.json [06:56:26] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete keytab [labs/private] - 10https://gerrit.wikimedia.org/r/1159125 (owner: 10Muehlenhoff) [06:57:05] (03CR) 10Brouberol: [C:03+1] Remove obsolete analytics_cluster::postgresql role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1155720 (https://phabricator.wikimedia.org/T395557) (owner: 10Btullis) [06:57:42] FIRING: JobUnavailable: Reduced availability for job squid in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:57:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:59:35] (03PS9) 10Btullis: Airflow: Add local settings to enable the xcom_sidecar functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) [06:59:50] (03CR) 10Brouberol: "This now requires a chart version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154248 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [07:00:00] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [07:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T0700). [07:00:05] anzx and WMDE-Fisch: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:22] o/ [07:00:24] (03PS4) 10Brouberol: mediawiki: define a dumps suspended CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786) [07:00:27] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [07:00:28] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:00:28] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install7001.wikimedia.org [07:00:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10916711 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `install7001.wikimedia.org` - install7001.wikimedia.org (**PA... [07:01:36] (03PS3) 10Brouberol: airflow: set a low priority on all airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107) [07:02:42] RESOLVED: JobUnavailable: Reduced availability for job squid in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:02:42] (03PS4) 10Brouberol: airflow: set a low priority on all airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107) [07:05:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2204 with weight 0 T396976', diff saved to https://phabricator.wikimedia.org/P77970 and previous config saved to /var/cache/conftool/dbconfig/20250616-070524-root.json [07:05:29] T396976: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T396976 [07:05:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s2 T396976 [07:05:42] (03CR) 10Brouberol: [C:03+1] monitoring services: add migration task T384214 to instances [puppet] - 10https://gerrit.wikimedia.org/r/1155619 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [07:06:17] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1159072 (https://phabricator.wikimedia.org/T396976) (owner: 10Gerrit maintenance bot) [07:06:29] moritzm: ok to merge? [07:07:36] Anyone here that can deploy? :-) [07:08:59] (03CR) 10Brouberol: [C:03+2] airflow: set a low priority on all airflow task pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156812 (https://phabricator.wikimedia.org/T395107) (owner: 10Brouberol) [07:09:11] Seems I'm not allowed access to that new tool :-/ [07:09:52] marostegui: sorry, yes please [07:10:51] Doing it moritzm [07:11:07] thx [07:11:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77971 and previous config saved to /var/cache/conftool/dbconfig/20250616-071132-marostegui.json [07:13:03] WMDE-Fisch: you have someone to deploy the patch? [07:13:20] I deploy it now [07:13:33] Nope, just wanted to poke adam but if you got a sec that would be nice [07:13:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156741 (https://phabricator.wikimedia.org/T395871) (owner: 10Svantje Lilienthal) [07:14:23] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:14:29] (03Merged) 10jenkins-bot: Enable sub-referencing on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156741 (https://phabricator.wikimedia.org/T395871) (owner: 10Svantje Lilienthal) [07:14:52] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1156741|Enable sub-referencing on test wiki (T395871)]] [07:14:56] T395871: Enable sub-referencing on test wiki - https://phabricator.wikimedia.org/T395871 [07:15:03] Amir1: Thx! Any idea why I'm not granted access to that deployment interface with my account although I've got deployment rights? 🤔 [07:15:26] I have no idea, I'd say poke Tyler [07:15:41] o/ [07:17:23] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:19:05] (03PS1) 10Muehlenhoff: Update Cumin alias for Docker registry [puppet] - 10https://gerrit.wikimedia.org/r/1159291 (https://phabricator.wikimedia.org/T390251) [07:20:43] (03PS1) 10Muehlenhoff: Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1159293 (https://phabricator.wikimedia.org/T395557) [07:20:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:20:54] WMDE-Fisch: you can request access to the Spiderpig access group at https://idm.wikimedia.org/permissions/ [07:21:20] Amir1: I have one patch to add , will add it to calendar before you finish syncing above [07:21:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:21:40] sure, this is going to be slow I think [07:22:07] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be2066 - https://phabricator.wikimedia.org/T395990#10916761 (10MatthewVernon) 05Open→03Resolved I've had a look, and this system looks good to me know (right number of filesystems of the right size, puppet happy, `swift-reco... [07:23:39] (03CR) 10Stevemunene: [C:03+2] hdfs: Add group 7_8 remove group 9_10 hosts from cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159102 (https://phabricator.wikimedia.org/T390174) (owner: 10Stevemunene) [07:24:01] (03PS2) 10Anzx: IP cap lift for event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980) [07:24:33] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159297 (https://phabricator.wikimedia.org/T396875) [07:25:10] (03PS3) 10Anzx: IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980) [07:25:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980) (owner: 10Anzx) [07:25:38] Amir1: added patch [07:26:36] 10SRE-swift-storage, 06Data-Persistence: ms-fe2015 is suffering intermittent errors on port 80 - https://phabricator.wikimedia.org/T396573#10916777 (10MatthewVernon) Thanks @Ladsgroup :) [07:26:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T396130)', diff saved to https://phabricator.wikimedia.org/P77973 and previous config saved to /var/cache/conftool/dbconfig/20250616-072640-marostegui.json [07:26:44] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:26:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [07:27:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T396130)', diff saved to https://phabricator.wikimedia.org/P77974 and previous config saved to /var/cache/conftool/dbconfig/20250616-072702-marostegui.json [07:27:13] Thx moritzm just requested access now :-) [07:28:21] RECOVERY - Hadoop DataNode on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [07:28:41] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1160.eqiad.wmnet [07:28:59] RECOVERY - Hadoop DataNode on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [07:29:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10916787 (10ops-monitoring-bot) Host an-worker1160.eqiad.wmnet rebooted by stevemunene@cumin1002 w... [07:29:11] RECOVERY - Hadoop DataNode on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [07:29:22] the image is still being built [07:29:28] !log Starting s2 codfw failover from db2207 to db2204 - T396976 [07:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:31] T396976: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T396976 [07:29:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2204 to s2 primary T396976', diff saved to https://phabricator.wikimedia.org/P77975 and previous config saved to /var/cache/conftool/dbconfig/20250616-072955-root.json [07:30:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2207 T396976', diff saved to https://phabricator.wikimedia.org/P77976 and previous config saved to /var/cache/conftool/dbconfig/20250616-073045-marostegui.json [07:31:21] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:31:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2207.codfw.wmnet with reason: Maintenance [07:33:51] (03CR) 10Volans: [C:03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/1159291 (https://phabricator.wikimedia.org/T390251) (owner: 10Muehlenhoff) [07:34:21] PROBLEM - Hadoop NodeManager on an-worker1161 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:35:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152558 (https://phabricator.wikimedia.org/T380930) (owner: 10KartikMistry) [07:35:49] !log ladsgroup@deploy1003 lilients, ladsgroup: Backport for [[gerrit:1156741|Enable sub-referencing on test wiki (T395871)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:35:53] T395871: Enable sub-referencing on test wiki - https://phabricator.wikimedia.org/T395871 [07:36:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) [07:36:21] (03CR) 10Volans: [C:03+1] "Although not assigned to any host I see the role is still there. Is is obsolete and to be removed or there is some maintenance and will re" [puppet] - 10https://gerrit.wikimedia.org/r/1159293 (https://phabricator.wikimedia.org/T395557) (owner: 10Muehlenhoff) [07:36:25] (03PS1) 10Marostegui: db2207: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159302 (https://phabricator.wikimedia.org/T396976) [07:37:20] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:37:47] WMDE-Fisch: it's on test servers [07:37:50] please test [07:37:58] (03CR) 10Marostegui: [C:03+2] db2207: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159302 (https://phabricator.wikimedia.org/T396976) (owner: 10Marostegui) [07:38:03] (03PS9) 10Brouberol: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [07:38:05] k [07:38:32] (03PS1) 10Vgutierrez: hiera: Switch lvs7001 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1159303 (https://phabricator.wikimedia.org/T396561) [07:40:37] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:40:52] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159303 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [07:40:57] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:40:59] PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:41:03] PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:41:05] PROBLEM - Hadoop NodeManager on an-worker1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:41:11] PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:41:15] (03PS1) 10Muehlenhoff: offboard-user: Run a shallow git clone [puppet] - 10https://gerrit.wikimedia.org/r/1159328 [07:41:15] (03PS1) 10Muehlenhoff: Remove SSH key for aarora [puppet] - 10https://gerrit.wikimedia.org/r/1159329 [07:41:21] PROBLEM - Hadoop NodeManager on an-worker1175 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:41:57] RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:42:08] Amir1: Hmm I don't get it on the test server. But it should be working. Might be caching involved... [07:42:18] Please go on. [07:42:25] !log ladsgroup@deploy1003 lilients, ladsgroup: Continuing with sync [07:42:29] okay [07:43:25] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T396130)', diff saved to https://phabricator.wikimedia.org/P77977 and previous config saved to /var/cache/conftool/dbconfig/20250616-074346-marostegui.json [07:43:52] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [07:44:07] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1160.eqiad.wmnet [07:44:34] (03PS10) 10Brouberol: Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [07:45:21] Ah now it's working an the test servers so all good. :-) [07:45:21] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:45:23] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:45:38] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1161.eqiad.wmnet [07:46:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10916837 (10ops-monitoring-bot) Host an-worker1161.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting... [07:47:22] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: add memcached-based index caching to store [puppet] - 10https://gerrit.wikimedia.org/r/1156341 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [07:47:25] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1159328 (owner: 10Muehlenhoff) [07:47:32] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: trial store memcache on titan[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/1156342 (https://phabricator.wikimedia.org/T394319) (owner: 10Filippo Giunchedi) [07:47:36] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10916847 (10Anton.Kokh) @KFrancis thank you, I just signed it! [07:48:59] RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:49:12] (03PS1) 10Vgutierrez: Revert^3 "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1159351 [07:49:40] (03PS2) 10Muehlenhoff: offboard-user: Run a shallow git clone [puppet] - 10https://gerrit.wikimedia.org/r/1159328 [07:50:21] RECOVERY - Hadoop NodeManager on an-worker1161 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:50:39] (03CR) 10Muehlenhoff: [C:03+2] offboard-user: Run a shallow git clone [puppet] - 10https://gerrit.wikimedia.org/r/1159328 (owner: 10Muehlenhoff) [07:51:11] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10916852 (10Vgutierrez) 05Resolved→03Open acme-chief is still unable to issue certificates for this domain: `lang=json { "identifier": { "type": "dns", "value": "pywikipe... [07:51:11] RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:51:19] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:51:31] (03CR) 10Vgutierrez: [C:03+2] Revert^3 "acmechief: Add pywikipedia.org to the cert list" [puppet] - 10https://gerrit.wikimedia.org/r/1159351 (owner: 10Vgutierrez) [07:51:59] PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:53:20] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1161.eqiad.wmnet [07:53:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [07:54:01] (03CR) 10Ladsgroup: "Yeah, I can run the script on all wikis to clean them up." [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [07:54:11] PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:54:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [07:55:04] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1149-1153].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 9 and 10 [07:55:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10916871 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1e2de4df-1e1e-43b0-ba8... [07:55:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [07:55:44] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156741|Enable sub-referencing on test wiki (T395871)]] (duration: 40m 51s) [07:55:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [07:55:47] T395871: Enable sub-referencing on test wiki - https://phabricator.wikimedia.org/T395871 [07:55:51] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1175-1176].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 9 and 10 [07:55:53] WMDE-Fisch: deployed [07:55:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10916875 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=95b11ba7-0512-4582-810... [07:56:17] jouncebot: now and next [07:56:17] For the next 0 hour(s) and 3 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T0700) [07:56:25] Amir1: mine both can sync at once [07:56:29] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1162.eqiad.wmnet [07:56:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [07:56:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10916880 (10ops-monitoring-bot) Host an-worker1162.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting... [07:57:21] anzx: the to is wrong it's in the past https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1159292/3/wmf-config/throttle.php [07:57:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [07:57:26] (03CR) 10Ladsgroup: [C:03+2] mrwiki: add मसूदा (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) (owner: 10Anzx) [07:57:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) (owner: 10Anzx) [07:58:02] (03PS4) 10Anzx: IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980) [07:58:03] I deploy the mrwiki patch, it should be much faster now [07:58:31] Amir1: thanks fixed date [07:58:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77978 and previous config saved to /var/cache/conftool/dbconfig/20250616-075855-marostegui.json [07:59:01] (03Merged) 10jenkins-bot: mrwiki: add मसूदा (draft) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156092 (https://phabricator.wikimedia.org/T396551) (owner: 10Anzx) [07:59:12] (03PS1) 10Muehlenhoff: Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263) [07:59:16] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1156092|mrwiki: add मसूदा (draft) namespace (T396551)]] [07:59:20] T396551: Add new namespace मसूदा on mrwiki (with specific edit/move group restrictions) - https://phabricator.wikimedia.org/T396551 [07:59:39] (03PS2) 10Muehlenhoff: Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263) [07:59:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [08:00:06] RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:00:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [08:01:22] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:02:06] RECOVERY - Hadoop NodeManager on an-worker1152 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:02:38] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:03:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [08:03:28] (03CR) 10Fabfur: [C:03+1] "good job and godspeed!" [puppet] - 10https://gerrit.wikimedia.org/r/1159303 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [08:03:39] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: write SPDX header to stack config on save [puppet] - 10https://gerrit.wikimedia.org/r/1156781 (owner: 10Filippo Giunchedi) [08:03:47] !log ladsgroup@deploy1003 ladsgroup, anzx: Backport for [[gerrit:1156092|mrwiki: add मसूदा (draft) namespace (T396551)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:03:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [08:04:01] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1162.eqiad.wmnet [08:04:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [08:04:48] checking [08:04:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [08:04:58] (03CR) 10Jelto: miscweb: add os-reports update mechanism (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154866 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [08:05:11] FIRING: [2x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:13] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1157.eqiad.wmnet [08:05:33] Amir1: namespace appears, ok to continue [08:05:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10916910 (10ops-monitoring-bot) Host an-worker1157.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting... [08:05:38] !log ladsgroup@deploy1003 ladsgroup, anzx: Continuing with sync [08:06:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [08:06:48] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [08:08:22] RECOVERY - Hadoop NodeManager on an-worker1175 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:09:53] (03CR) 10Ayounsi: [C:03+1] Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [08:10:10] FIRING: [2x] SystemdUnitFailed: dhcp-helper.service on ganeti7001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:08] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1157.eqiad.wmnet [08:13:26] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1158.eqiad.wmnet [08:13:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10916918 (10ops-monitoring-bot) Host an-worker1158.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting... [08:14:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77979 and previous config saved to /var/cache/conftool/dbconfig/20250616-081402-marostegui.json [08:14:07] (03PS3) 10Filippo Giunchedi: thanos: enforce series limit for sidecar [puppet] - 10https://gerrit.wikimedia.org/r/1155190 (https://phabricator.wikimedia.org/T394318) [08:14:07] (03PS1) 10Filippo Giunchedi: thanos: enable tracing for rule [puppet] - 10https://gerrit.wikimedia.org/r/1159355 (https://phabricator.wikimedia.org/T394318) [08:14:28] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156092|mrwiki: add मसूदा (draft) namespace (T396551)]] (duration: 15m 11s) [08:14:32] T396551: Add new namespace मसूदा on mrwiki (with specific edit/move group restrictions) - https://phabricator.wikimedia.org/T396551 [08:14:41] Amir1: please run namespaceDupes.php for mrwiki [08:15:04] (03Abandoned) 10Alexandros Kosiaris: Switch canaries to 0.1% OpenTelemetry sampling [puppet] - 10https://gerrit.wikimedia.org/r/984814 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [08:15:16] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Remove old docker_registry_ha hiera keys [labs/private] - 10https://gerrit.wikimedia.org/r/1156762 (https://phabricator.wikimedia.org/T390251) (owner: 10Alexandros Kosiaris) [08:16:26] I will [08:16:44] (03CR) 10Alexandros Kosiaris: [C:03+2] "Sigh, missed that in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1154302, I only removed the old profile and didn't rename this o" [puppet] - 10https://gerrit.wikimedia.org/r/1159291 (https://phabricator.wikimedia.org/T390251) (owner: 10Muehlenhoff) [08:17:09] (03CR) 10Silvan Heintze: [C:03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159297 (https://phabricator.wikimedia.org/T396875) (owner: 10Jakob) [08:17:41] (03PS3) 10Muehlenhoff: Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263) [08:18:05] (03PS4) 10Muehlenhoff: Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263) [08:18:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980) (owner: 10Anzx) [08:18:52] RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:18:55] (03CR) 10Ayounsi: [C:03+1] Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [08:19:07] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin depooling P{lvs7001.magru.wmnet} and A:liberica (T396561) [08:19:11] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [08:19:15] (03Merged) 10jenkins-bot: IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159292 (https://phabricator.wikimedia.org/T396980) (owner: 10Anzx) [08:19:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77980 and previous config saved to /var/cache/conftool/dbconfig/20250616-081922-root.json [08:19:30] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1159292|IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 (T396980)]] [08:19:34] T396980: Lift IP cap on 2025-06-19 for Wikipedia workshop - cs.wikipedia - https://phabricator.wikimedia.org/T396980 [08:19:47] (03CR) 10Muehlenhoff: [C:03+2] Reimage ganeti7003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159354 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [08:20:30] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7001.magru.wmnet} and A:liberica (T396561) [08:20:44] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1158.eqiad.wmnet [08:21:20] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs7001 to katran [puppet] - 10https://gerrit.wikimedia.org/r/1159303 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [08:21:21] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1159.eqiad.wmnet [08:21:24] !log ladsgroup@deploy1003 anzx, ladsgroup: Backport for [[gerrit:1159292|IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 (T396980)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:21:43] Amir1: nothing to test , ok to sync [08:21:52] yup [08:21:52] moritzm: ok to merge Reimage ganeti7003 with insetup role (368d9a4b17)? [08:21:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10916951 (10ops-monitoring-bot) Host an-worker1159.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting... [08:22:42] !log ladsgroup@deploy1003 anzx, ladsgroup: Continuing with sync [08:23:24] (03Abandoned) 10Alexandros Kosiaris: Rename docker_registry_ha's occurrences to docker_registry [labs/private] - 10https://gerrit.wikimedia.org/r/1155601 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [08:27:52] (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159297 (https://phabricator.wikimedia.org/T396875) (owner: 10Jakob) [08:28:05] (03PS1) 10Marostegui: db1254: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159357 (https://phabricator.wikimedia.org/T396549) [08:28:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1254 T396549', diff saved to https://phabricator.wikimedia.org/P77981 and previous config saved to /var/cache/conftool/dbconfig/20250616-082841-marostegui.json [08:28:46] marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [08:28:46] T396549: Migrate s2 to MariaDB 10.11 - https://phabricator.wikimedia.org/T396549 [08:28:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1254.eqiad.wmnet with reason: Maintenance [08:28:48] (03CR) 10Btullis: [C:03+1] mediawiki: define a dumps suspended CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [08:29:02] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1159.eqiad.wmnet [08:29:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T396130)', diff saved to https://phabricator.wikimedia.org/P77982 and previous config saved to /var/cache/conftool/dbconfig/20250616-082910-marostegui.json [08:29:14] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: allow the airflow service account to query CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156830 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [08:29:14] (03CR) 10Marostegui: [C:03+2] db1254: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159357 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [08:29:14] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:29:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [08:29:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T396130)', diff saved to https://phabricator.wikimedia.org/P77983 and previous config saved to /var/cache/conftool/dbconfig/20250616-082933-marostegui.json [08:29:44] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159292|IP cap lift for wikipedia workshop - cs.wikipedia on 19June2025 (T396980)]] (duration: 10m 13s) [08:29:48] T396980: Lift IP cap on 2025-06-19 for Wikipedia workshop - cs.wikipedia - https://phabricator.wikimedia.org/T396980 [08:29:52] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159297 (https://phabricator.wikimedia.org/T396875) (owner: 10Jakob) [08:30:00] Amir1: Thanks for deploying & please run `mwscript-k8s --comment='T396980' --follow resetAuthenticationThrottle.php --wiki=cswiki --signup --ip 78.128.191.240` https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold [08:30:20] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [08:30:30] sure [08:30:35] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [08:31:06] PROBLEM - Hadoop NodeManager on an-worker1166 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:31:10] done [08:31:22] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [08:31:25] thanks [08:31:27] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs7001.magru.wmnet with reason: switching to katran [08:31:38] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [08:32:06] RECOVERY - Hadoop NodeManager on an-worker1166 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:32:17] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10917019 (10MoritzMuehlenhoff) [08:32:23] https://www.irccloud.com/pastebin/8FQ8lWSz/ [08:32:28] mrwiki [08:32:34] anzx: ^ [08:33:01] thanks [08:33:04] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:34:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1254 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P77984 and previous config saved to /var/cache/conftool/dbconfig/20250616-083419-root.json [08:34:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P77985 and previous config saved to /var/cache/conftool/dbconfig/20250616-083428-root.json [08:35:22] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [08:35:38] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [08:36:05] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti7003.magru.wmnet with OS bookworm [08:37:04] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:38:22] (03PS1) 10Majavah: policies: Rename cr-labs -> cr-cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1159360 [08:39:45] (03PS4) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) [08:40:52] (03Abandoned) 10Ayounsi: Rename labs and cloud filters [homer/public] - 10https://gerrit.wikimedia.org/r/767476 (owner: 10Ayounsi) [08:42:14] (03PS5) 10Gkyziridis: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) [08:43:48] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1159360 (owner: 10Majavah) [08:43:50] RECOVERY - Disk space on stat1011 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1011&var-datasource=eqiad+prometheus/ops [08:44:06] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner1003.eqiad.wmnet with OS bookworm [08:44:49] (03CR) 10Majavah: [C:03+2] policies: Rename cr-labs -> cr-cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1159360 (owner: 10Majavah) [08:45:22] (03Merged) 10jenkins-bot: policies: Rename cr-labs -> cr-cloud-hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1159360 (owner: 10Majavah) [08:46:43] (03CR) 10Tiziano Fogli: [C:03+2] pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [08:47:24] (03CR) 10Aqu: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:48:37] !log cr policy: rename cr-labs to cr-cloud-hosts (https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1159360) [08:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T396130)', diff saved to https://phabricator.wikimedia.org/P77986 and previous config saved to /var/cache/conftool/dbconfig/20250616-084907-marostegui.json [08:49:11] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [08:49:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1254 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77987 and previous config saved to /var/cache/conftool/dbconfig/20250616-084925-root.json [08:49:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77988 and previous config saved to /var/cache/conftool/dbconfig/20250616-084934-root.json [08:50:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:37] !log zabe@deploy1003:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php wikidatawiki --delete /home/zabe/text_table_cleanup/wikidatawiki --sleep 0.5 # T183490 [08:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:42] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [08:54:33] !log depooling ncredir7003 [08:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:40] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7003.magru.wmnet with reason: host reimage [08:58:57] !log repool ncredir7003 [08:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:27] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage [09:00:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Add db1252', diff saved to https://phabricator.wikimedia.org/P77989 and previous config saved to /var/cache/conftool/dbconfig/20250616-090058-fceratto.json [09:01:38] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7003.magru.wmnet with reason: host reimage [09:02:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10917315 (10Stevemunene) 05Open→03Resolved Hosts are back online rejoining the cluster {F62348242} [09:03:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 8 - rack E5) - https://phabricator.wikimedia.org/T390175#10917321 (10Stevemunene) [09:04:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10917325 (10Stevemunene) 05Open→03Resolved Hosts are back online and rejoining the cluster {F62348266} [09:04:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77990 and previous config saved to /var/cache/conftool/dbconfig/20250616-090414-marostegui.json [09:04:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 7 - rack E6) - https://phabricator.wikimedia.org/T390174#10917332 (10Stevemunene) [09:04:26] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:04:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1254 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77991 and previous config saved to /var/cache/conftool/dbconfig/20250616-090431-root.json [09:04:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1003.eqiad.wmnet with reason: host reimage [09:04:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77992 and previous config saved to /var/cache/conftool/dbconfig/20250616-090439-root.json [09:06:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004 (10WMDE-leszek) 03NEW [09:07:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10917359 (10WMDE-leszek) I figure @AndyRussG_volunteer also needs to be added to `nda` LDAP group. I believe their account has been there, so maybe there's still a trace of N... [09:10:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10917362 (10WMDE-leszek) Me having opened this request does indicate that I approve this request on WMDE's end. [09:10:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1252* slowly with 10 steps - Pooling in [09:11:36] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10917364 (10WMDE-leszek) [09:12:33] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1252* slowly with 10 steps - Pooling in [09:14:59] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1252* slowly with 10 steps - Pooling in [09:18:24] (03PS1) 10Vgutierrez: hiera: Repool lvs7001 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1159374 (https://phabricator.wikimedia.org/T396561) [09:18:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159374 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:19:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77995 and previous config saved to /var/cache/conftool/dbconfig/20250616-091921-marostegui.json [09:19:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1254 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77996 and previous config saved to /var/cache/conftool/dbconfig/20250616-091936-root.json [09:20:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:23:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1003.eqiad.wmnet with OS bookworm [09:23:27] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7003.magru.wmnet with OS bookworm [09:26:11] (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs7001 with katran [puppet] - 10https://gerrit.wikimedia.org/r/1159374 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:26:51] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs7001.magru.wmnet [09:26:52] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs7001.magru.wmnet [09:27:10] (03PS1) 10Filippo Giunchedi: hadoop: remove check_procs based alerts in favor of SystemdUnitFailed [puppet] - 10https://gerrit.wikimedia.org/r/1159385 (https://phabricator.wikimedia.org/T357099) [09:30:59] !log repool lvs7001 using katran as forwarding plane - T396561 [09:31:03] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs7001.magru.wmnet} and A:liberica [09:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:03] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [09:31:21] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs7001.magru.wmnet} and A:liberica [09:31:52] !log zabe@deploy1003:~$ mwscript extensions/AbuseFilter/maintenance/MigrateESRefToAflTable.php wikidatawiki --deletedump /home/zabe/afl_text_table_deletedump/wikidatawiki --dump /home/zabe/afl_text_table_dump/wikidatawiki --sleep 0.5 # T381599 [09:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:56] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [09:34:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T396130)', diff saved to https://phabricator.wikimedia.org/P77998 and previous config saved to /var/cache/conftool/dbconfig/20250616-093429-marostegui.json [09:34:33] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:34:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1254 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77999 and previous config saved to /var/cache/conftool/dbconfig/20250616-093442-root.json [09:34:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1192.eqiad.wmnet with reason: Maintenance [09:34:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T396130)', diff saved to https://phabricator.wikimedia.org/P78000 and previous config saved to /var/cache/conftool/dbconfig/20250616-093451-marostegui.json [09:36:24] (03PS2) 10Filippo Giunchedi: prometheus: move blackbox-exporter to log prober errors [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) [09:37:16] (03CR) 10Filippo Giunchedi: "> Toolforge seems to be using 0.26, but the metricsinfra servers are still on bullseye / 0.18.0+ds-3+b2." [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi) [09:37:29] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus: move blackbox-exporter to log prober errors [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi) [09:39:03] (03PS1) 10Zabe: wikidatawiki: Increase revision-slots cache back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159388 (https://phabricator.wikimedia.org/T183490) [09:40:36] (03PS1) 10Filippo Giunchedi: pontoon: keep netbox-hiera updated [puppet] - 10https://gerrit.wikimedia.org/r/1159389 [09:41:53] (03PS1) 10Ayounsi: Routed Ganeti: disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) [09:42:08] jouncebot: nowandnext [09:42:08] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [09:42:08] In 0 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1000) [09:42:46] (03CR) 10Zabe: [C:03+2] Stop setting $wgPageLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1158804 (https://phabricator.wikimedia.org/T299947) (owner: 10Zabe) [09:42:54] (03CR) 10Zabe: [C:03+2] wikidatawiki: Increase revision-slots cache back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159388 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [09:44:09] !log remove magru01 in Netbox (all Ganeti nodes have been removed from it) T394263 [09:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:13] T394263: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263 [09:44:23] (03Merged) 10jenkins-bot: Stop setting $wgPageLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1158804 (https://phabricator.wikimedia.org/T299947) (owner: 10Zabe) [09:44:26] (03Merged) 10jenkins-bot: wikidatawiki: Increase revision-slots cache back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159388 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [09:45:07] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1159388|wikidatawiki: Increase revision-slots cache back to default (T183490)]], [[gerrit:1158804|Stop setting $wgPageLinksSchemaMigrationStage (T299947)]] [09:45:12] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [09:45:12] T299947: Normalize pagelinks table - https://phabricator.wikimedia.org/T299947 [09:45:25] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove ganeti7003 - jmm@cumin2002" [09:45:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove ganeti7003 - jmm@cumin2002" [09:46:23] (03Abandoned) 10Filippo Giunchedi: reimage: check for Monitoring::Host in puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/1156264 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [09:46:28] (03Abandoned) 10Filippo Giunchedi: monitoring: add note about reimage cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1156265 (https://phabricator.wikimedia.org/T395449) (owner: 10Filippo Giunchedi) [09:46:31] (03CR) 10Vgutierrez: Routed Ganeti: disable rp_filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:47:00] !log zabe@deploy1003 zabe: Backport for [[gerrit:1159388|wikidatawiki: Increase revision-slots cache back to default (T183490)]], [[gerrit:1158804|Stop setting $wgPageLinksSchemaMigrationStage (T299947)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:47:02] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:49:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:51:00] !log zabe@deploy1003 zabe: Continuing with sync [09:51:00] (03PS2) 10Ayounsi: Routed Ganeti: disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) [09:51:06] (03CR) 10Ayounsi: Routed Ganeti: disable rp_filter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:51:12] RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:51:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T396130)', diff saved to https://phabricator.wikimedia.org/P78002 and previous config saved to /var/cache/conftool/dbconfig/20250616-095135-marostegui.json [09:51:39] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [09:53:12] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:54:36] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:57:53] (03PS3) 10Ayounsi: Routed Ganeti: disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) [09:57:54] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159388|wikidatawiki: Increase revision-slots cache back to default (T183490)]], [[gerrit:1158804|Stop setting $wgPageLinksSchemaMigrationStage (T299947)]] (duration: 12m 46s) [09:57:58] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [09:57:59] T299947: Normalize pagelinks table - https://phabricator.wikimedia.org/T299947 [09:58:35] (03PS4) 10Ayounsi: Routed Ganeti: disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) [09:58:47] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:59:27] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove magru01 cluster - jmm@cumin2002" [09:59:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove magru01 cluster - jmm@cumin2002" [09:59:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1000) [10:02:47] (03PS1) 10Marostegui: db1246: Install MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159397 (https://phabricator.wikimedia.org/T396549) [10:03:15] (03CR) 10Marostegui: [C:03+2] db1246: Install MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159397 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [10:04:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [10:04:50] (03PS1) 10Arnaudb: gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) [10:05:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [10:05:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T395241)', diff saved to https://phabricator.wikimedia.org/P78005 and previous config saved to /var/cache/conftool/dbconfig/20250616-100521-fceratto.json [10:06:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P78006 and previous config saved to /var/cache/conftool/dbconfig/20250616-100642-marostegui.json [10:07:42] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: disable rp_filter [puppet] - 10https://gerrit.wikimedia.org/r/1159390 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [10:08:34] (03CR) 10Hnowlan: [C:03+2] rest-gateway: route html<->wikitext transforms to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156811 (https://phabricator.wikimedia.org/T396856) (owner: 10Hnowlan) [10:10:37] (03Merged) 10jenkins-bot: rest-gateway: route html<->wikitext transforms to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156811 (https://phabricator.wikimedia.org/T396856) (owner: 10Hnowlan) [10:11:02] (03PS1) 10Muehlenhoff: Add ganeti7003 to the routed Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159398 (https://phabricator.wikimedia.org/T394263) [10:12:00] (03PS5) 10Jgiannelos: changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) [10:12:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T395241)', diff saved to https://phabricator.wikimedia.org/P78008 and previous config saved to /var/cache/conftool/dbconfig/20250616-101244-fceratto.json [10:15:17] (03PS1) 10Filippo Giunchedi: bird: remove check_anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) [10:15:55] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:16:04] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:16:39] (03CR) 10Filippo Giunchedi: [C:03+1] "Might not be needed after all, see Iff23cb1941ca3b0" [puppet] - 10https://gerrit.wikimedia.org/r/1155142 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [10:18:06] (03CR) 10Jgiannelos: changeprop: Remove rules related to parsoid (RB sunset) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [10:20:06] (03CR) 10Hnowlan: [C:03+1] changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [10:21:25] (03PS1) 10Majavah: P:wmcs::cloudgw: Install natlog on cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) [10:21:31] (03CR) 10Jgiannelos: [C:03+2] changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [10:21:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P78010 and previous config saved to /var/cache/conftool/dbconfig/20250616-102150-marostegui.json [10:22:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [10:23:27] (03CR) 10Gmodena: [C:03+2] dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156821 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [10:23:29] (03Merged) 10jenkins-bot: changeprop: Remove rules related to parsoid (RB sunset) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156389 (https://phabricator.wikimedia.org/T367418) (owner: 10Jgiannelos) [10:24:33] (03PS2) 10Majavah: P:wmcs::cloudgw: Install natlog on cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) [10:25:12] (03Merged) 10jenkins-bot: dse: mw-content-history: version bump staging app [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156821 (https://phabricator.wikimedia.org/T347282) (owner: 10Gmodena) [10:25:12] !log installing qemu security updates [10:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:01] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [10:26:34] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:27:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P78011 and previous config saved to /var/cache/conftool/dbconfig/20250616-102752-fceratto.json [10:28:34] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:28:37] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:28:39] (03PS3) 10Majavah: P:wmcs::cloudgw: Install natlog on cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) [10:28:43] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:28:46] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:29:04] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:29:16] !log Manual run job.batch/update-special-pages-s8-manual-202506161028 started - T396977 [10:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:20] T396977: MediaWiki periodic job update-special-pages-s8 failed - https://phabricator.wikimedia.org/T396977 [10:29:34] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:29:36] (03PS1) 10Marostegui: db1229: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159403 (https://phabricator.wikimedia.org/T396549) [10:29:46] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:29:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1229 T396549', diff saved to https://phabricator.wikimedia.org/P78012 and previous config saved to /var/cache/conftool/dbconfig/20250616-102949-marostegui.json [10:29:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [10:29:54] T396549: Migrate s2 to MariaDB 10.11 - https://phabricator.wikimedia.org/T396549 [10:29:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1229.eqiad.wmnet with reason: Maintenance [10:30:01] (03CR) 10Muehlenhoff: [C:03+2] Remove SSH key for aarora [puppet] - 10https://gerrit.wikimedia.org/r/1159329 (owner: 10Muehlenhoff) [10:30:15] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:30:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:31:37] (03CR) 10Marostegui: [C:03+2] db1229: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1159403 (https://phabricator.wikimedia.org/T396549) (owner: 10Marostegui) [10:31:39] (03PS1) 10Muehlenhoff: cross-validate-accounts: Run a shallow git clone [puppet] - 10https://gerrit.wikimedia.org/r/1159404 [10:31:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:32:14] (03CR) 10David Caro: "This broke cloud instances:" [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [10:34:26] (03PS1) 10David Caro: cloud.yaml: add missing profile::puppetdb::pdb_resource_exporter_config [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) [10:34:50] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [10:34:53] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [10:35:06] (03CR) 10David Caro: "Fix here https://gerrit.wikimedia.org/r/c/operations/puppet/+/1159406" [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [10:36:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T396130)', diff saved to https://phabricator.wikimedia.org/P78014 and previous config saved to /var/cache/conftool/dbconfig/20250616-103657-marostegui.json [10:37:02] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:37:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1203.eqiad.wmnet with reason: Maintenance [10:37:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T396130)', diff saved to https://phabricator.wikimedia.org/P78015 and previous config saved to /var/cache/conftool/dbconfig/20250616-103720-marostegui.json [10:37:38] (03CR) 10Filippo Giunchedi: [C:03+1] cloud.yaml: add missing profile::puppetdb::pdb_resource_exporter_config [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) (owner: 10David Caro) [10:43:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P78016 and previous config saved to /var/cache/conftool/dbconfig/20250616-104259-fceratto.json [10:43:23] (03PS2) 10David Caro: cloud.yaml: add missing profile::puppetdb::pdb_resource_exporter_config [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) [10:44:16] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) (owner: 10David Caro) [10:44:39] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) (owner: 10David Caro) [10:47:13] (03PS3) 10David Caro: puppetdb: allow making the exporter config null [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) [10:48:54] (03PS4) 10David Caro: puppetdb: allow making the exporter config null [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) [10:49:02] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) (owner: 10David Caro) [10:51:53] (03CR) 10David Caro: [C:03+2] puppetdb: allow making the exporter config null [puppet] - 10https://gerrit.wikimedia.org/r/1159406 (https://phabricator.wikimedia.org/T395442) (owner: 10David Caro) [10:53:50] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#10917731 (10Clement_Goubert) [10:53:50] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:53:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T396130)', diff saved to https://phabricator.wikimedia.org/P78018 and previous config saved to /var/cache/conftool/dbconfig/20250616-105353-marostegui.json [10:53:56] jouncebot: nowandnext [10:53:56] For the next 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1000) [10:53:56] In 2 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1300) [10:53:58] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [10:54:04] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:55:04] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:55:12] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:56:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78019 and previous config saved to /var/cache/conftool/dbconfig/20250616-105621-root.json [10:57:42] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:58:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T395241)', diff saved to https://phabricator.wikimedia.org/P78020 and previous config saved to /var/cache/conftool/dbconfig/20250616-105806-fceratto.json [11:01:02] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti7003 to the routed Ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159398 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:08:07] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet [11:09:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P78022 and previous config saved to /var/cache/conftool/dbconfig/20250616-110901-marostegui.json [11:11:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78023 and previous config saved to /var/cache/conftool/dbconfig/20250616-111127-root.json [11:14:56] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-backup2002.codfw.wmnet with reason: Maintenance and reboot [11:15:05] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1159404 (owner: 10Muehlenhoff) [11:15:54] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1252* slowly with 10 steps - Pooling in [11:19:55] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet [11:21:56] (03CR) 10Brouberol: "Adding a core SRE to the patch as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [11:24:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P78026 and previous config saved to /var/cache/conftool/dbconfig/20250616-112408-marostegui.json [11:26:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78027 and previous config saved to /var/cache/conftool/dbconfig/20250616-112633-root.json [11:30:50] (03CR) 10Jgiannelos: [C:03+1] trafficserver: migrate html<->wikitext transforms out of restbase [puppet] - 10https://gerrit.wikimedia.org/r/1156813 (https://phabricator.wikimedia.org/T396856) (owner: 10Hnowlan) [11:34:01] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [11:34:21] (03PS1) 10Cathal Mooney: Rename cloud-in to cloud-vrf-in [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 [11:37:56] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner1004.eqiad.wmnet with OS bookworm [11:38:45] 10ops-magru: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T390258#10917858 (10cmooney) >>! In T390258#10911945, @ayounsi wrote: > Looking at Mar 28 2025, there seems like there was some small events, but nothing worth investigating, we can close that for now. Yep agreed. >>! In T390258#10910... [11:38:49] (03CR) 10Muehlenhoff: [C:03+2] Apply ncredir role to ncredir7004 [puppet] - 10https://gerrit.wikimedia.org/r/1156814 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:39:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T396130)', diff saved to https://phabricator.wikimedia.org/P78028 and previous config saved to /var/cache/conftool/dbconfig/20250616-113915-marostegui.json [11:39:20] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:39:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1209.eqiad.wmnet with reason: Maintenance [11:39:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T396130)', diff saved to https://phabricator.wikimedia.org/P78029 and previous config saved to /var/cache/conftool/dbconfig/20250616-113938-marostegui.json [11:40:34] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:41:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1229 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78030 and previous config saved to /var/cache/conftool/dbconfig/20250616-114138-root.json [11:43:40] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:59] (03CR) 10Majavah: [C:04-1] Rename cloud-in to cloud-vrf-in (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 (owner: 10Cathal Mooney) [11:44:01] (03CR) 10Muehlenhoff: [C:03+2] cross-validate-accounts: Run a shallow git clone [puppet] - 10https://gerrit.wikimedia.org/r/1159404 (owner: 10Muehlenhoff) [11:45:34] (03CR) 10Brouberol: [C:03+2] mediawiki: define a dumps suspended CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156820 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [11:45:42] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: allow the airflow service account to query CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156830 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [11:50:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [11:50:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [11:50:14] !log jmm@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti7003.magru.wmnet to cluster magru03 and group B [11:51:21] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7003.magru.wmnet to cluster magru03 and group B [11:54:12] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage [11:54:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T396130)', diff saved to https://phabricator.wikimedia.org/P78031 and previous config saved to /var/cache/conftool/dbconfig/20250616-115417-marostegui.json [11:54:22] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [11:56:46] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7003.magru.wmnet to drbd [11:57:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1004.eqiad.wmnet with reason: host reimage [11:57:19] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10917964 (10ops-monitoring-bot) VM durum7003.magru.wmnet switching disk type to drbd [12:02:30] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10917977 (10Stevemunene) [12:03:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10917978 (10Stevemunene) [12:03:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10917979 (10Stevemunene) [12:06:57] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7003.magru.wmnet to drbd [12:07:13] PROBLEM - Host durum7003 is DOWN: PING CRITICAL - Packet loss = 100% [12:08:13] RECOVERY - Host durum7003 is UP: PING OK - Packet loss = 0%, RTA = 115.64 ms [12:09:13] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7003 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [12:09:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P78033 and previous config saved to /var/cache/conftool/dbconfig/20250616-120924-marostegui.json [12:09:49] PROBLEM - Bird Internet Routing Daemon on durum7003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:10:49] RECOVERY - Bird Internet Routing Daemon on durum7003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:11:13] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7003 is OK: OK: UP (pid=2398) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [12:11:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Very slow data transfers during migrations affecting ganeti1047/ganeti1048 - https://phabricator.wikimedia.org/T397025 (10MoritzMuehlenhoff) 03NEW [12:11:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Very slow data transfers during migrations affecting ganeti1047/ganeti1048 - https://phabricator.wikimedia.org/T397025#10917999 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:15:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1004.eqiad.wmnet with OS bookworm [12:17:41] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7003.wikimedia.org to drbd [12:18:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2080.codfw.wmnet with OS bullseye [12:18:14] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#10918024 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2080.codfw.wm... [12:18:19] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye [12:18:32] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2080 [12:18:51] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [12:19:59] (03PS1) 10Andrew Bogott: codfw1dev ceph: cloudcephmons -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1159425 (https://phabricator.wikimedia.org/T309789) [12:20:41] jmm@cumin1003 changedisk (PID 1825196) is awaiting input [12:22:49] (03CR) 10Tiziano Fogli: [C:03+1] pontoon: keep netbox-hiera updated [puppet] - 10https://gerrit.wikimedia.org/r/1159389 (owner: 10Filippo Giunchedi) [12:24:21] jouncebot: nowandnext [12:24:21] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [12:24:21] In 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1300) [12:24:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P78034 and previous config saved to /var/cache/conftool/dbconfig/20250616-122432-marostegui.json [12:24:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10918046 (10ops-monitoring-bot) VM doh7003.wikimedia.org switching disk type to drbd [12:24:52] (03PS1) 10Máté Szabó: Add missing labels for email confirmation reminder preferences [extensions/Echo] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159438 (https://phabricator.wikimedia.org/T58074) [12:24:58] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2080 - mvernon@cumin2002" [12:25:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2080 - mvernon@cumin2002" [12:25:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:25:04] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2080.codfw.wmnet 245.48.192.10.in-addr.arpa 5.4.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:25:07] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2080.codfw.wmnet 245.48.192.10.in-addr.arpa 5.4.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:25:08] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2080 [12:25:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/Echo] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159438 (https://phabricator.wikimedia.org/T58074) (owner: 10Máté Szabó) [12:25:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2080 [12:25:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2080 [12:25:47] (03CR) 10Stevemunene: [C:03+2] zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [12:27:30] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: keep netbox-hiera updated [puppet] - 10https://gerrit.wikimedia.org/r/1159389 (owner: 10Filippo Giunchedi) [12:30:13] jouncebot: now and next [12:30:13] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [12:30:42] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:34:12] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [12:34:14] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7003.wikimedia.org to drbd [12:34:18] PROBLEM - Host doh7003 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:40] RECOVERY - Host doh7003 is UP: PING OK - Packet loss = 0%, RTA = 115.64 ms [12:34:50] PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7003 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [12:35:18] PROBLEM - Bird Internet Routing Daemon on doh7003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:35:42] RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:37:45] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [12:37:50] RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7003 is OK: OK: UP (pid=2336) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [12:38:18] RECOVERY - Bird Internet Routing Daemon on doh7003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [12:39:37] (03CR) 10Majavah: "Hmm, the rules in `alerts.git:team-traffic/anycast_healthchecker.yaml` are for traffic roles only so this is effectively removing alerting" [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi) [12:39:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T396130)', diff saved to https://phabricator.wikimedia.org/P78035 and previous config saved to /var/cache/conftool/dbconfig/20250616-123939-marostegui.json [12:39:45] (03PS1) 10Jcrespo: dbbackups: Advance snapshot dbbackups start time by 4 hours [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380) [12:39:46] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:39:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1214.eqiad.wmnet with reason: Maintenance [12:40:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T396130)', diff saved to https://phabricator.wikimedia.org/P78036 and previous config saved to /var/cache/conftool/dbconfig/20250616-124002-marostegui.json [12:40:32] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [12:41:07] (03CR) 10Elukey: [C:03+1] phabricator: expand support for Phabricator tasks (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 (owner: 10Volans) [12:41:41] (03PS2) 10Jcrespo: dbbackups: Advance snapshot dbbackups start time by 4 hours [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380) [12:42:57] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2080.codfw.wmnet with reason: host reimage [12:45:15] (03CR) 10Jcrespo: "Snapshots take ~11 hours to complete ATM." [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [12:46:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2080.codfw.wmnet with reason: host reimage [12:48:00] jouncebot: nowandnext [12:48:00] No deployments scheduled for the next 0 hour(s) and 11 minute(s) [12:48:00] In 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1300) [12:50:11] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:55] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:51:08] (03CR) 10Filippo Giunchedi: "Good point, we can certainly extend/duplicate the alert to other ac-healthchecker users." [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi) [12:51:43] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [12:52:13] PROBLEM - Zookeeper Server on an-conf1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [12:52:27] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T396130)', diff saved to https://phabricator.wikimedia.org/P78037 and previous config saved to /var/cache/conftool/dbconfig/20250616-125442-marostegui.json [12:54:48] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [12:54:48] (03PS1) 10Bartosz Dziewoński: Try subresource JS autologin on SUL3 domain first if configured [extensions/CentralAuth] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159444 (https://phabricator.wikimedia.org/T391284) [12:54:53] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:54:55] (03CR) 10Majavah: [C:03+1] "Mostly I want to be alerted when a service is unhealthy causing the announcement to be withdrawn. On a closer look the old monitoring didn" [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi) [12:55:04] (03PS1) 10Bartosz Dziewoński: Fix adding warnings to ParserOutput [extensions/TemplateStyles] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159446 (https://phabricator.wikimedia.org/T396768) [12:55:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159444 (https://phabricator.wikimedia.org/T391284) (owner: 10Bartosz Dziewoński) [12:55:23] (03CR) 10Jforrester: [C:03+1] Fix adding warnings to ParserOutput [extensions/TemplateStyles] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159446 (https://phabricator.wikimedia.org/T396768) (owner: 10Bartosz Dziewoński) [12:55:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/TemplateStyles] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159446 (https://phabricator.wikimedia.org/T396768) (owner: 10Bartosz Dziewoński) [12:55:27] was there a small net downtime on eqiad C3 ? several hosts complained at the same time [12:55:33] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [12:57:36] (03CR) 10Btullis: [C:03+1] mediawiki: convert the dumps Job into a CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156831 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [12:57:49] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: drop the batch.Job.get rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156832 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [12:57:57] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:58:37] (03PS2) 10Brouberol: mediawiki: convert the dumps Job into a CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156831 (https://phabricator.wikimedia.org/T389786) [12:58:37] (03PS2) 10Brouberol: mediawiki-dumps-legacy: drop the batch.Job.get rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156832 (https://phabricator.wikimedia.org/T389786) [12:58:43] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [12:58:45] webproxy is timing me out [12:58:47] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:58:58] things are weird right now, network-wise [12:59:12] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of bast7002.wikimedia.org to drbd [12:59:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10918166 (10ops-monitoring-bot) VM bast7002.wikimedia.org switching disk type to drbd [12:59:33] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1300). [13:00:04] phuedx, Tchanders, Mvolz, mszabo, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:13] topranks: I think (potentially on C3, but not 100% sure) something os causing network downtimes [13:00:14] I can’t deploy at the moment, I’m in a meeting [13:00:19] might be able to deploy in 30 minutes if nobody else is around [13:00:20] C3 on eqiad [13:00:29] jynus: ok [13:00:48] (03CR) 10Kosta Harlan: [C:03+1] temp accounts: Enable temp account creation on three wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155683 (https://phabricator.wikimedia.org/T396464) (owner: 10Tchanders) [13:00:52] hi [13:01:36] o/ [13:02:12] jynus: what do you suspect is happening? [13:02:27] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:28] communication within our cluster seem flaky at times [13:02:32] "several hosts complained at the same time" [13:02:48] only in that rack? [13:02:49] multiple times I got "Could not connect to webproxy.eqiad.wmnet" [13:02:53] topranks: mostly [13:03:03] that's why I am not 100 sure [13:03:04] o/ [13:03:06] I can start deploying some of these patches, might not have time for all of them [13:03:19] I can do the config changes and backports as two separate deploys using SpiderPig? [13:03:19] I can self-service if I don't fit into the window [13:03:34] topranks: let's say I observer only errors on C3, but I cannot say it was something else too [13:03:38] Tchanders beat me to it :) [13:03:48] phuedx: Go for it! [13:04:09] !log disable puppet on all hosts using the bird puppet module for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052109 [13:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:25] Mvolz: yt? [13:04:49] (03CR) 10Ayounsi: [C:03+2] Bird: use the "interface" config option for v6 peers [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [13:04:51] jynus: ok, to be specific wikikube-workers is it? [13:04:54] (03PS1) 10Stevemunene: add an-conf1006 to the list of analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/1159451 (https://phabricator.wikimedia.org/T374922) [13:04:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2080.codfw.wmnet with OS bullseye [13:05:05] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#10918197 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2080.codfw.wmnet... [13:05:12] phuedx: yup [13:05:20] "connection timed out" from es1032 [13:06:27] my guess is the other hosts that complained (an-conf1006 or where I saw packets being lost: db1150) had the same network issue [13:06:33] I don't have permissions to +2 the config repo, but I can do the deploy itself myself (with hand holding). [13:06:38] Tchanders, Mvolz: I _think_ I can bundle our config changes into one deploy to reduce time. They're all completely unrelated [13:07:06] I'm okay with that but if it goes wrong you'll have to roll back the whole thing [13:07:16] (03CR) 10Brouberol: [C:03+1] add an-conf1006 to the list of analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/1159451 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [13:07:22] phuedx: That sounds good [13:07:25] 06SRE, 10Pywikibot: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10918209 (10Vgutierrez) >>! In T388809#10893993, @siebrand wrote: > DNS NS records updated, and now pointing to Wikimedia. we need DNSSEC disabled on the registrar to be able to handl... [13:07:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156872 (https://phabricator.wikimedia.org/T395692) (owner: 10Phuedx) [13:07:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders) [13:07:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153307 (https://phabricator.wikimedia.org/T395933) (owner: 10Dreamy Jazz) [13:07:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [13:07:44] jynus: ok running some tests now from them hosts to see if I can find anything [13:08:18] (03Merged) 10jenkins-bot: ext-EventStreamConfig: Update product_metrics.web_base stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156872 (https://phabricator.wikimedia.org/T395692) (owner: 10Phuedx) [13:08:24] (03Merged) 10jenkins-bot: Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders) [13:08:26] (03Merged) 10jenkins-bot: Enable temporary accounts onboarding dialog on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153307 (https://phabricator.wikimedia.org/T395933) (owner: 10Dreamy Jazz) [13:08:26] (03CR) 10Stevemunene: [C:03+2] add an-conf1006 to the list of analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/1159451 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [13:08:28] (03Merged) 10jenkins-bot: Change citoid config for test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [13:08:43] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org [reason: bird testing CR 1052109] [13:08:43] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1156872|ext-EventStreamConfig: Update product_metrics.web_base stream (T395692)]], [[gerrit:1127960|Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group (T376315)]], [[gerrit:1153307|Enable temporary accounts onboarding dialog on WMF wikis (T395933)]], [[gerrit:1139808|Change citoid config for test wiki (T361576)]] [13:08:54] T395692: Add performer_pageview_id contextual attribute to web base stream - https://phabricator.wikimedia.org/T395692 [13:08:54] T376315: Control access to global-temporary-account-viewer group on WMF wikis automatically - https://phabricator.wikimedia.org/T376315 [13:08:54] T395933: Enable the temporary accounts onboarding dialog on WMF wikis - https://phabricator.wikimedia.org/T395933 [13:08:54] T361576: Switch from restbase to rest-gateway for Citoid - https://phabricator.wikimedia.org/T361576 [13:09:47] topranks: it seems not to be ongoing, so maybe someone just started a too fast data transmission from that rack [13:09:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P78039 and previous config saved to /var/cache/conftool/dbconfig/20250616-130950-marostegui.json [13:09:56] (03CR) 10Brouberol: [C:03+2] mediawiki: convert the dumps Job into a CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156831 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [13:09:57] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [13:09:58] which server do i pick for testing? (when the time comes) I forget :) [13:10:02] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: drop the batch.Job.get rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156832 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [13:10:22] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloudgw: Install natlog on cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1159400 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [13:10:30] jynus: yeah I didn't spot anything in graphs yet, I'll start looking at logs now shortly once I verify things look ok right now [13:10:38] do you have an approximate timestamp you noticed the problems? [13:10:40] !log phuedx@deploy1003 phuedx, mvolz, dreamyjazz, tchanders: Backport for [[gerrit:1156872|ext-EventStreamConfig: Update product_metrics.web_base stream (T395692)]], [[gerrit:1127960|Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group (T376315)]], [[gerrit:1153307|Enable temporary accounts onboarding dialog on WMF wikis (T395933)]], [[gerrit:1139808|Change citoid config for test wiki (T361576)]] synced to t [13:10:40] he testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:11:08] Tchanders, Mvolz: Please test your changes and report back [13:11:11] (03CR) 10Effie Mouzeli: "That is all correct!" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [13:11:20] topranks: I saw a small increase in tcp retransmited, but nothing out of the ordinary: https://grafana.wikimedia.org/goto/-EQm46YNg?orgId=1 [13:11:20] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10918247 (10MatthewVernon) Quick question: I'm concerned about the rather vague timeline for deleting `tegola-swift-eqiad-v... [13:11:26] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org [reason: tested CR 1052109] [13:11:56] phuedx: Looks good to me [13:12:27] fyi, i can't deploy my own changes, i would appreciate if someone could click the necessary buttons for me. they can go out together to save time. [13:12:36] (03Merged) 10jenkins-bot: mediawiki: convert the dumps Job into a CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156831 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [13:12:52] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: drop the batch.Job.get rbac [deployment-charts] - 10https://gerrit.wikimedia.org/r/1156832 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [13:14:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [13:14:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T395241)', diff saved to https://phabricator.wikimedia.org/P78040 and previous config saved to /var/cache/conftool/dbconfig/20250616-131410-fceratto.json [13:14:21] (03CR) 10Bking: [C:03+2] elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1138400 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [13:15:01] I've tested my change. I'm seeing the correct contextual attributes coming through for enwiki and metawikiwiki [13:15:06] Mvolz? [13:15:11] phuedx: mine broke test wiki [13:15:13] no go [13:15:13] RECOVERY - Zookeeper Server on an-conf1006 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [13:15:15] sorry [13:15:32] at least it didn't break en wiki [13:15:42] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10918268 (10MatthewVernon) To put a little more context on that: ` root@thanos-fe1004:/home/mvernon# for b in $(swift list)... [13:15:45] Noted. I'll stop this deployment [13:15:59] Mvolz: Could you submit a revert? [13:16:06] !log phuedx@deploy1003 Sync cancelled. [13:16:32] (03PS1) 10Mvolz: Revert "Change citoid config for test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159455 [13:16:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [13:16:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [13:16:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1026 T395241', diff saved to https://phabricator.wikimedia.org/P78041 and previous config saved to /var/cache/conftool/dbconfig/20250616-131646-marostegui.json [13:17:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1026.eqiad.wmnet with reason: Maintenance [13:17:11] (03PS1) 10Majavah: natlog: Set required START=yes on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1159456 (https://phabricator.wikimedia.org/T273734) [13:17:19] Tchanders, Mvolz: I'll sync that revert and that should get us to where we need to be [13:17:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159455 (owner: 10Mvolz) [13:18:00] (03CR) 10Majavah: [C:03+2] natlog: Set required START=yes on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1159456 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [13:18:22] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast7002.wikimedia.org to drbd [13:18:37] PROBLEM - Host bast7002 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:48] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev ceph: cloudcephmons -> puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1159425 (https://phabricator.wikimedia.org/T309789) (owner: 10Andrew Bogott) [13:18:53] (03CR) 10Ladsgroup: [C:03+1] swift: restore ms-be2080 to the rings post-reimage [puppet] - 10https://gerrit.wikimedia.org/r/1138832 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [13:19:15] RECOVERY - Restbase root url on restbase1043 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/RESTBase [13:19:17] (03Merged) 10jenkins-bot: Revert "Change citoid config for test wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159455 (owner: 10Mvolz) [13:19:23] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye [13:19:29] RECOVERY - Restbase root url on restbase1041 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/RESTBase [13:19:30] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1159455|Revert "Change citoid config for test wiki"]] [13:19:39] RECOVERY - Host bast7002 is UP: PING OK - Packet loss = 0%, RTA = 115.43 ms [13:19:59] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye [13:20:44] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10918314 (10elukey) @Jgiannelos @MSantos Hi! My understanding is that Tegola is now using `tegola-swift-codfw-v002` and `te... [13:20:49] phuedx: Thank you! [13:21:16] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10918325 (10MatthewVernon) ...so ideally, delete all the old data and then you can just go ahead (and maybe let's make a ro... [13:21:24] !log phuedx@deploy1003 mvolz, phuedx: Backport for [[gerrit:1159455|Revert "Change citoid config for test wiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:21:29] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10918326 (10Jelto) >>! In T378922#10848027, @jcrespo wrote: > I am working on setting up the dedicated gitlab/gerrit storage host,... [13:21:42] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for cmelo - https://phabricator.wikimedia.org/T395966#10918327 (10herron) 05Open→03Resolved [13:21:50] (03CR) 10MVernon: [C:03+2] swift: restore ms-be2080 to the rings post-reimage [puppet] - 10https://gerrit.wikimedia.org/r/1138832 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [13:21:55] Mvolz: Could you check that testwiki is OK now? [13:22:04] (03CR) 10Mvolz: "When we tried to deploy this it literally put "false" in the test wiki config... i.e. requests were made to https://test.wikipedia.org/w/f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [13:22:04] Tchanders: Would you mind re-checking your changes? I'll do the same [13:22:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78042 and previous config saved to /var/cache/conftool/dbconfig/20250616-132250-root.json [13:23:26] phuedx: Still looks good [13:23:34] !log root@cumin1002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for ms-backup2002.codfw.wmnet: Renew puppet certificate - root@cumin1002 [13:23:37] 06SRE, 10SRE-swift-storage, 07SRE-Unowned, 06Data-Persistence, 10Maps: Create a new bucket for Tegola's tile cache and duplicate its data - https://phabricator.wikimedia.org/T396584#10918333 (10MoritzMuehlenhoff) >>! In T396584#10918325, @MatthewVernon wrote: > ...so ideally, delete all the old data and... [13:24:07] I've re-confirmed that the correct context attributes are appearing on enwiki and metawikiwiki [13:24:24] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#10918346 (10MatthewVernon) [13:24:34] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host gitlab-runner2002.codfw.wmnet with OS bookworm [13:24:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T395241)', diff saved to https://phabricator.wikimedia.org/P78043 and previous config saved to /var/cache/conftool/dbconfig/20250616-132452-fceratto.json [13:25:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P78044 and previous config saved to /var/cache/conftool/dbconfig/20250616-132504-marostegui.json [13:25:19] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7003.magru.wmnet to drbd [13:25:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10918362 (10ops-monitoring-bot) VM ncredir7003.magru.wmnet switching disk type to drbd [13:26:13] (03PS1) 10Majavah: natlog: Fix line matching [puppet] - 10https://gerrit.wikimedia.org/r/1159457 (https://phabricator.wikimedia.org/T273734) [13:26:40] phuedx: yeah it's okay now [13:26:55] Thanks. Continuing [13:27:07] !log phuedx@deploy1003 mvolz, phuedx: Continuing with sync [13:27:37] (03CR) 10Majavah: [C:03+2] natlog: Fix line matching [puppet] - 10https://gerrit.wikimedia.org/r/1159457 (https://phabricator.wikimedia.org/T273734) (owner: 10Majavah) [13:30:44] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10918393 (10herron) 05Open→03Stalled [13:31:57] !log T362392 [13:31:59] ha [13:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:02] T362392: Routed Ganeti: Add support for VM BGP - https://phabricator.wikimedia.org/T362392 [13:32:10] !log sudo cumin -b1 -s30 'A:dnsbox' "run-puppet-agent --enable 'CR1052109'": T362392 [13:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:58] !log sudo cumin -b1 -s30 'A:wikidough' "run-puppet-agent --enable 'CR1052109'": T362392 [13:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:53] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159455|Revert "Change citoid config for test wiki"]] (duration: 14m 22s) [13:34:24] Tchanders, Mvolz: Done [13:34:39] (03PS1) 10Bking: elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1159459 (https://phabricator.wikimedia.org/T387569) [13:35:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159459 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [13:35:16] (03PS2) 10NMW03: Set category collation to "uca-az" for Azerbaijani projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) [13:35:20] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7003.magru.wmnet to drbd [13:35:22] PROBLEM - Host ncredir7003 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:40] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [13:35:42] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 115.82 ms [13:36:47] XioNoX: ^ what was this flap about? I don't see how ncredir could be related to the bird change but perhaps the ganeti? [13:36:48] mszabo: Do you want to self-service deploy after I've deployed MatmaRex's as a pair? [13:37:00] sounds good [13:37:27] sukhe: moritzm switching the VMs back to drbd [13:37:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153722 (https://phabricator.wikimedia.org/T395896) (owner: 10NMW03) [13:37:50] sukhe: see few lines above "jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7003.magru.wmnet to drbd" [13:37:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78045 and previous config saved to /var/cache/conftool/dbconfig/20250616-133755-root.json [13:38:03] !log jmm@cumin1003 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus7002.magru.wmnet to drbd [13:38:47] sukhe: these are inactive nodes, I'm switching the VMs to DRDB disk storage now that the routed Ganeti cluster has grown to three nodes [13:38:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159444 (https://phabricator.wikimedia.org/T391284) (owner: 10Bartosz Dziewoński) [13:38:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [extensions/TemplateStyles] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159446 (https://phabricator.wikimedia.org/T396768) (owner: 10Bartosz Dziewoński) [13:38:51] ah thanks, sorry! [13:38:52] missed that [13:39:05] MatmaRex: I'll ping you when the changes are ready to test [13:39:14] thanks [13:39:19] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [13:39:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10918548 (10ops-monitoring-bot) VM prometheus7002.magru.wmnet switching disk type to drbd [13:39:46] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10918550 (10jcrespo) >>! In T378922#10918326, @Jelto wrote: > Thank you for the work on dedicated hardware. In T378922#10804784 I t... [13:39:56] I guess there will be no time left for my patch, right? [13:40:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P78046 and previous config saved to /var/cache/conftool/dbconfig/20250616-134000-fceratto.json [13:40:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T396130)', diff saved to https://phabricator.wikimedia.org/P78047 and previous config saved to /var/cache/conftool/dbconfig/20250616-134012-marostegui.json [13:40:18] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:40:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1226.eqiad.wmnet with reason: Maintenance [13:40:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T396130)', diff saved to https://phabricator.wikimedia.org/P78048 and previous config saved to /var/cache/conftool/dbconfig/20250616-134036-marostegui.json [13:40:40] (03Merged) 10jenkins-bot: Try subresource JS autologin on SUL3 domain first if configured [extensions/CentralAuth] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159444 (https://phabricator.wikimedia.org/T391284) (owner: 10Bartosz Dziewoński) [13:41:01] (03Merged) 10jenkins-bot: Fix adding warnings to ParserOutput [extensions/TemplateStyles] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159446 (https://phabricator.wikimedia.org/T396768) (owner: 10Bartosz Dziewoński) [13:41:20] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1159444|Try subresource JS autologin on SUL3 domain first if configured (T391284)]], [[gerrit:1159446|Fix adding warnings to ParserOutput (T396768)]] [13:41:26] T391284: Swap order of central autologin lookup for loginwiki and shared domain - https://phabricator.wikimedia.org/T391284 [13:41:26] T396768: InvalidArgumentException: Scalar parameter must be a string, number, Stringable, or MessageSpecifier; got array - https://phabricator.wikimedia.org/T396768 [13:41:50] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10918563 (10herron) [13:42:13] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2002.codfw.wmnet with reason: host reimage [13:42:51] (03PS1) 10Fabfur: cache,haproxy: remove old ipblock map files [puppet] - 10https://gerrit.wikimedia.org/r/1159461 (https://phabricator.wikimedia.org/T396621) [13:43:13] !log phuedx@deploy1003 phuedx, matmarex: Backport for [[gerrit:1159444|Try subresource JS autologin on SUL3 domain first if configured (T391284)]], [[gerrit:1159446|Fix adding warnings to ParserOutput (T396768)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:43:33] testing [13:43:45] Turns out you get pinged automatically :) [13:45:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:45] (03CR) 10Brouberol: [C:03+1] elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1159459 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [13:45:51] (03CR) 10Brouberol: [C:03+2] Airflow analytics-test: Add scheduler access to hadoop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154071 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [13:45:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2002.codfw.wmnet with reason: host reimage [13:45:56] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159461 (https://phabricator.wikimedia.org/T396621) (owner: 10Fabfur) [13:46:28] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: return traffic to all DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155738 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:47:10] !log enable puppet and run agent on cephosd1001 [13:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:13] phuedx: both look good [13:47:17] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-user (and LDAP nda, wmde) for Anton Kokh (WMDE) - https://phabricator.wikimedia.org/T395917#10918595 (10herron) Hi @Anton.Kokh could you please add a unique SSH key he... [13:47:23] MatmaRex: ACK. Continuing [13:47:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:47:30] !log phuedx@deploy1003 phuedx, matmarex: Continuing with sync [13:48:13] (03PS4) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly determine latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [13:48:24] (03CR) 10Tchanders: "We have the go-ahead from product and comms." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155683 (https://phabricator.wikimedia.org/T396464) (owner: 10Tchanders) [13:49:17] (03CR) 10Volans: [C:03+2] phabricator: expand support for Phabricator tasks [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 (owner: 10Volans) [13:49:29] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10918610 (10herron) [13:51:07] (03PS1) 10Vgutierrez: hiera: Switch esams to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1159462 (https://phabricator.wikimedia.org/T395131) [13:51:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159462 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [13:51:25] (03PS5) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly determine latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [13:51:45] FIRING: SLOMetricAbsent: wdqs-availability magru - https://slo.wikimedia.org/?search=wdqs-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:52:25] (03CR) 10Ssingh: [C:03+1] "!" [puppet] - 10https://gerrit.wikimedia.org/r/1159462 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [13:52:53] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-backup2001.codfw.wmnet with reason: Maintenance and reboot [13:53:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78049 and previous config saved to /var/cache/conftool/dbconfig/20250616-135301-root.json [13:54:29] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159444|Try subresource JS autologin on SUL3 domain first if configured (T391284)]], [[gerrit:1159446|Fix adding warnings to ParserOutput (T396768)]] (duration: 13m 09s) [13:54:35] T391284: Swap order of central autologin lookup for loginwiki and shared domain - https://phabricator.wikimedia.org/T391284 [13:54:35] T396768: InvalidArgumentException: Scalar parameter must be a string, number, Stringable, or MessageSpecifier; got array - https://phabricator.wikimedia.org/T396768 [13:54:37] (03CR) 10Brouberol: [C:03+1] "Good spot, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1159293 (https://phabricator.wikimedia.org/T395557) (owner: 10Muehlenhoff) [13:55:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P78050 and previous config saved to /var/cache/conftool/dbconfig/20250616-135507-fceratto.json [13:55:11] thanks phuedx [13:55:55] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch esams to unified cert issued by GTS [puppet] - 10https://gerrit.wikimedia.org/r/1159462 (https://phabricator.wikimedia.org/T395131) (owner: 10Vgutierrez) [13:56:04] (03Merged) 10jenkins-bot: phabricator: expand support for Phabricator tasks [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1154786 (owner: 10Volans) [13:56:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T396130)', diff saved to https://phabricator.wikimedia.org/P78051 and previous config saved to /var/cache/conftool/dbconfig/20250616-135605-marostegui.json [13:56:09] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [13:56:15] jouncebot: nowandnext [13:56:16] For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1300) [13:56:16] In 1 hour(s) and 33 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1530) [13:56:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [13:56:43] mszabo: No change in the logs after those deployments. Over to you :) [13:57:26] thanks! [13:57:39] !log use Google Trust Services (GTS) unified TLS certificate on esams - T395131 [13:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:43] T395131: Replace Digicert TLS certs with Google Trust Services ones - https://phabricator.wikimedia.org/T395131 [13:58:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/Echo] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159438 (https://phabricator.wikimedia.org/T58074) (owner: 10Máté Szabó) [13:58:52] PROBLEM - Host prometheus7002 is DOWN: PING CRITICAL - Packet loss = 100% [13:59:50] May need to revert the enabling of the onboarding dialog (Tchanders change) [14:00:34] (03PS6) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [14:00:57] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:01:05] 06SRE, 06Infrastructure-Foundations: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#10918659 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:01:07] (03CR) 10CI reject: [V:04-1] profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:01:15] (03CR) 10Jcrespo: [C:03+2] dbbackups: Advance snapshot dbbackups start time by 4 hours [puppet] - 10https://gerrit.wikimedia.org/r/1159439 (https://phabricator.wikimedia.org/T389380) (owner: 10Jcrespo) [14:01:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10918663 (10herron) Hello! Here are a few next-steps to complete before proceeding with access: * @KFrancis could you please confirm NDA for @AndyRussG / @AndyRussG_volunte... [14:01:29] Dreamy_Jazz: There's a large gap between now and the Wikimedia Portals Update. We've got a lot of room :) [14:01:38] Sure. Thanks. [14:01:47] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye [14:01:49] (03PS1) 10Urbanecm: [Growth] Prepare for the Get Started notification experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) [14:01:53] (03CR) 10Effie Mouzeli: "Yeah I agree, I do not have strong opinions either" [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:02:05] (03CR) 10Urbanecm: [C:04-2] "needs Kirsten's confirmation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159465 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [14:02:42] Dreamy_Jazz: I already have a running deploy, but hopefully should be done soon [14:04:39] (03PS7) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [14:05:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2002.codfw.wmnet with OS bookworm [14:05:08] (03CR) 10CI reject: [V:04-1] profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:05:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:05:17] (03CR) 10Ilias Sarantopoulos: [C:04-1] ores-extension: enable extension with revertrisk filter for the third batch of wikis (038 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [14:05:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:05:49] (03PS8) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [14:06:18] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:07:11] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1156837 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [14:07:14] (03CR) 10Scott French: [C:03+2] alertmanager: update data-persistence-task phid [puppet] - 10https://gerrit.wikimedia.org/r/1156837 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [14:08:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78052 and previous config saved to /var/cache/conftool/dbconfig/20250616-140807-root.json [14:08:16] Worked out the issue with Tchanders change. It's an issue with a translation and we have decided to leave it enabled. [14:09:51] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye [14:10:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T395241)', diff saved to https://phabricator.wikimedia.org/P78053 and previous config saved to /var/cache/conftool/dbconfig/20250616-141016-fceratto.json [14:10:18] (03Merged) 10jenkins-bot: Add missing labels for email confirmation reminder preferences [extensions/Echo] (wmf/1.45.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1159438 (https://phabricator.wikimedia.org/T58074) (owner: 10Máté Szabó) [14:10:36] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1159438|Add missing labels for email confirmation reminder preferences (T58074)]] [14:10:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [14:10:40] T58074: Echo: Generate periodic web notification to nudge users to confirm an unverified email address - https://phabricator.wikimedia.org/T58074 [14:10:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T395241)', diff saved to https://phabricator.wikimedia.org/P78054 and previous config saved to /var/cache/conftool/dbconfig/20250616-141044-fceratto.json [14:11:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P78055 and previous config saved to /var/cache/conftool/dbconfig/20250616-141113-marostegui.json [14:13:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:13:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:14:46] (03PS1) 10Andrew Bogott: All cloudcephmon nodes to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1159469 [14:14:54] (03CR) 10Ssingh: [C:03+1] "Thanks! We have another bird-related change being rolled out today, just in case you were planning to merge it today. Tomorrow should be g" [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi) [14:15:27] (03PS2) 10Andrew Bogott: All cloudcephmon nodes to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1159469 [14:15:29] (03CR) 10Bking: [C:03+2] elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1159459 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [14:15:52] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159469 (owner: 10Andrew Bogott) [14:17:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:17:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:17:26] RESOLVED: SystemdUnitFailed: netbox_ganeti_magru01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:17:55] FIRING: SystemdUnitFailed: netbox_ganeti_magru01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:10] (03CR) 10Andrew Bogott: [C:03+2] All cloudcephmon nodes to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1159469 (owner: 10Andrew Bogott) [14:18:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10918768 (10MoritzMuehlenhoff) [14:19:27] Dreamy_Jazz: Noted. mszabo: Is the deployment still running? [14:19:51] !log upload liberica 0.19 to apt.wm.o (bookworm-wikimedia) - T397036 [14:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:56] T397036: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036 [14:19:58] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159470 [14:20:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T395241)', diff saved to https://phabricator.wikimedia.org/P78056 and previous config saved to /var/cache/conftool/dbconfig/20250616-142017-fceratto.json [14:21:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:21:33] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:22:10] 10ops-codfw, 06SRE, 06DC-Ops: Moving extra 1G port to make 10G space on cloud rack. - https://phabricator.wikimedia.org/T396363#10918783 (10Andrew) 05Open→03Resolved a:03Andrew [14:23:18] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10918790 (10Andrew) @Jhancock.wm any more blockers to this? There's no actual rush although finishing this will help me a bit with T309789 as it will allow... [14:23:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:24:19] (03PS9) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [14:24:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:24:27] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10918794 (10herron) @WMDE-leszek is this for a contract with end-date, or for ongoing access? [14:25:38] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [14:25:51] (03PS1) 10Brouberol: airflow: hotfix, remove duplicated env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159471 (https://phabricator.wikimedia.org/T369845) [14:25:58] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154010 (owner: 10PipelineBot) [14:26:00] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1154800 (owner: 10PipelineBot) [14:26:03] (03Abandoned) 10Jforrester: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155703 (owner: 10PipelineBot) [14:26:16] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1155735 (owner: 10PipelineBot) [14:26:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P78057 and previous config saved to /var/cache/conftool/dbconfig/20250616-142620-marostegui.json [14:26:33] (03CR) 10Scott French: "Thanks for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/1156838 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [14:26:47] (03PS2) 10Brouberol: airflow: hotfix, remove duplicated env variables and volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159471 (https://phabricator.wikimedia.org/T369845) [14:26:51] (03CR) 10Scott French: [C:03+2] sessionstore-resources: move SessionStoreDiskSpaceRunwayTooLow to task [alerts] - 10https://gerrit.wikimedia.org/r/1156838 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [14:28:10] (03CR) 10Majavah: [C:03+1] prometheus: move blackbox-exporter to log prober errors [puppet] - 10https://gerrit.wikimedia.org/r/1143810 (https://phabricator.wikimedia.org/T385022) (owner: 10Filippo Giunchedi) [14:28:18] phuedx: yeah, still building the image [14:28:28] !log upgrade to liberica 0.19 in lvs1013 - T397036 [14:28:28] (03PS10) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) [14:28:30] (03Merged) 10jenkins-bot: sessionstore-resources: move SessionStoreDiskSpaceRunwayTooLow to task [alerts] - 10https://gerrit.wikimedia.org/r/1156838 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [14:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:31] T397036: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036 [14:28:34] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs1013.eqiad.wmnet} and A:liberica (T397036) [14:28:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:29:06] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs1013.eqiad.wmnet} and A:liberica (T397036) [14:29:13] jouncebot: nowandnext [14:29:13] No deployments scheduled for the next 1 hour(s) and 0 minute(s) [14:29:13] In 1 hour(s) and 0 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1530) [14:30:00] 14:26:12 [root] Image builds completed is the last log I see locally [14:30:03] Decided that we do want to undo Tchanders [14:30:12] *Tchanders change [14:31:09] (03CR) 10Brouberol: [C:03+2] airflow: hotfix, remove duplicated env variables and volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159471 (https://phabricator.wikimedia.org/T369845) (owner: 10Brouberol) [14:31:17] (03PS1) 10Dreamy Jazz: Revert "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159477 [14:31:30] should be done in a sec now, it's finally deploying to testservers [14:31:56] (03PS2) 10Dreamy Jazz: Revert "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159477 [14:32:02] (03CR) 10Filippo Giunchedi: "For sure, I'll merge tomorrow EU morning" [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi) [14:32:12] (03CR) 10Filippo Giunchedi: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1159399 (https://phabricator.wikimedia.org/T374842) (owner: 10Filippo Giunchedi) [14:32:43] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2004-dev.codfw.wmnet with reason: host reimage [14:33:28] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:12] Started a spiderpig job for the revert given that everything else in the window seems to be done and just waiting on this last one to merge. [14:35:23] seems like my deploy is stuck on one of the non-k8s testservers for 4mins now [14:35:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155738 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:35:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P78058 and previous config saved to /var/cache/conftool/dbconfig/20250616-143525-fceratto.json [14:36:28] !log jmm@cumin1003 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus7002.magru.wmnet to drbd [14:36:48] FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:36:49] RECOVERY - Host prometheus7002 is UP: PING OK - Packet loss = 0%, RTA = 115.42 ms [14:36:55] 06SRE-OnFire, 10Cassandra, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10918872 (10Scott_French) [14:37:48] 06SRE-OnFire, 10Cassandra, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10918882 (10Scott_French) 05Open→03Resolved With the alert routing and severity changes now merged, I believe that wrap... [14:38:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wcqs2001:9195 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:39:11] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10918887 (10Jelto) >>! In T378922#10918550, @jcrespo wrote: > > I'm sorry, but I thought that was an "outline", a summary of our d... [14:39:31] RESOLVED: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:38] RESOLVED: SLOMetricAbsent: wdqs-availability magru - https://slo.wikimedia.org/?search=wdqs-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:40:13] Dreamy_Jazz: I'd say go ahead, scap clearly isn't collaborating with me today - I'm not sure if it allows you to go ahead in the present state, I can kill my deployment process as needed [14:40:34] I would need to wait for your scap lock to be released. [14:40:42] mszabo: if you do that, your patch will get deployed [14:40:51] when the next scap run goes [14:41:02] 14:40:23 Started scap-cdb-rebuild-testservers [14:41:06] it's watching us, clearly [14:41:10] :D [14:41:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T396130)', diff saved to https://phabricator.wikimedia.org/P78059 and previous config saved to /var/cache/conftool/dbconfig/20250616-144127-marostegui.json [14:41:32] T396130: Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log - https://phabricator.wikimedia.org/T396130 [14:41:34] I'd advise letting it finish :D [14:41:36] claime: yeah that would have been fine since could have checked it on the testservers in the next attempt [14:41:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [14:41:45] but now hopefully we've broken the impasse [14:41:48] FIRING: [3x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:41:55] The reason it's going slowly is because your backport had i18n changes [14:42:14] Changing i18n in backports makes everything really slow [14:42:31] yeah fair, I wonder why there are non-k8s test servers in there though - I thought the non-k8s mwdebug was gone already [14:42:34] 06SRE, 06Infrastructure-Foundations, 10vm-requests: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051 (10jijiki) 03NEW [14:42:51] that's T276994 apparently [14:42:51] T276994: Provide an mwdebug functionality on kubernetes (mw-experimental) - https://phabricator.wikimedia.org/T276994 [14:42:58] 06SRE, 06Infrastructure-Foundations, 10vm-requests: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10918906 (10jijiki) [14:43:25] FIRING: SystemdUnitFailed: wdqs-updater.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:56] 06SRE, 06Infrastructure-Foundations, 10vm-requests: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10918909 (10jijiki) [14:46:48] FIRING: [5x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:47:10] (03PS6) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [14:48:29] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1159438|Add missing labels for email confirmation reminder preferences (T58074)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:48:34] T58074: Echo: Generate periodic web notification to nudge users to confirm an unverified email address - https://phabricator.wikimedia.org/T58074 [14:49:05] 06SRE, 10Wikimedia-Mailing-lists: Add link to list archives in default footer - https://phabricator.wikimedia.org/T284256#10918944 (10Effeietsanders) I ran into this again as admin, who received a reminder of pending moderation requests. That currently has no link to Posterius and it's actually quite a few cli... [14:49:16] !log mszabo@deploy1003 mszabo: Continuing with sync [14:49:22] yay [14:49:56] \o/ [14:50:17] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2004-dev.codfw.wmnet with OS bullseye [14:50:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P78060 and previous config saved to /var/cache/conftool/dbconfig/20250616-145032-fceratto.json [14:50:55] I wonder if there's a Phab task for making our deployment tooling flag that a change has i18n changes and so will take $aLongTime [14:51:48] FIRING: [6x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:52:39] well it did tell me it rebuilt the localization cache, I just didn't draw the proper conclusion :) [14:53:08] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs7002.magru.wmnet} and A:liberica (T397036) [14:53:12] T397036: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036 [14:53:22] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin depooling P{lvs7002.magru.wmnet} and A:liberica [14:53:32] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7002.magru.wmnet} and A:liberica [14:53:55] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin pooling P{lvs7002.magru.wmnet} and A:liberica [14:54:01] (03CR) 10Gkyziridis: "Much appreciated that you worked in this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [14:54:19] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs7002.magru.wmnet} and A:liberica [14:54:20] (03PS1) 10Ssingh: hiera: set do_ech to false for durum1001 [puppet] - 10https://gerrit.wikimedia.org/r/1159486 [14:54:20] (03PS1) 10Ssingh: hiera: set do_ech to false for durum1002 [puppet] - 10https://gerrit.wikimedia.org/r/1159487 [14:54:20] (03PS1) 10Ssingh: hiera: set do_ech to false for durum2001 [puppet] - 10https://gerrit.wikimedia.org/r/1159488 [14:54:21] (03PS1) 10Ssingh: hiera: set do_ech to false for durum2002 [puppet] - 10https://gerrit.wikimedia.org/r/1159489 [14:54:21] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs7002.magru.wmnet} and A:liberica (T397036) [14:54:22] (03PS1) 10Ssingh: hiera: set do_ech to false for durum4001 [puppet] - 10https://gerrit.wikimedia.org/r/1159490 [14:54:23] (03PS1) 10Ssingh: hiera: set do_ech to false for durum4002 [puppet] - 10https://gerrit.wikimedia.org/r/1159491 [14:54:27] (03PS1) 10Ssingh: hiera: set do_ech to false for durum5001 [puppet] - 10https://gerrit.wikimedia.org/r/1159492 [14:54:31] (03PS1) 10Ssingh: hiera: set do_ech to false for durum5002 [puppet] - 10https://gerrit.wikimedia.org/r/1159493 [14:54:35] (03PS1) 10Ssingh: hiera: set do_ech to false for durum6001 [puppet] - 10https://gerrit.wikimedia.org/r/1159494 [14:54:39] (03PS1) 10Ssingh: hiera: set do_ech to false for durum6002 [puppet] - 10https://gerrit.wikimedia.org/r/1159495 [14:54:43] (03PS1) 10Ssingh: hiera: set do_ech to false for durum7002 [puppet] - 10https://gerrit.wikimedia.org/r/1159496 [14:54:47] (03PS1) 10Ssingh: hiera: set do_ech to false for durum7003 [puppet] - 10https://gerrit.wikimedia.org/r/1159497 [14:54:51] (03PS1) 10Ssingh: hiera: set do_ech to false for durum3003 [puppet] - 10https://gerrit.wikimedia.org/r/1159498 [14:54:56] (03PS1) 10Ssingh: hiera: set do_ech to false for durum3004 [puppet] - 10https://gerrit.wikimedia.org/r/1159499 [14:55:08] 🍿 [14:56:48] FIRING: [7x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:57:14] 06SRE, 06Infrastructure-Foundations, 10vm-requests: 2 VMs for mw-experimental - https://phabricator.wikimedia.org/T397051#10918982 (10MoritzMuehlenhoff) Looks fine, please use codfw/row C and eqiad eqiad/row B [14:57:25] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.upgrade upgradeing P{lvs7001.magru.wmnet} and A:liberica (T397036) [14:57:39] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin depooling P{lvs7001.magru.wmnet} and A:liberica [14:57:51] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7001.magru.wmnet} and A:liberica [14:58:12] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.admin pooling P{lvs7001.magru.wmnet} and A:liberica [14:58:37] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs7001.magru.wmnet} and A:liberica [14:58:39] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) upgradeing P{lvs7001.magru.wmnet} and A:liberica (T397036) [14:58:44] T397036: Provide connections stats for katran - https://phabricator.wikimedia.org/T397036 [15:00:40] (03CR) 10Ssingh: "Plan is to merge per host and reimage." [puppet] - 10https://gerrit.wikimedia.org/r/1159486 (owner: 10Ssingh) [15:01:48] FIRING: [6x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:01:57] (03PS7) 10Ilias Sarantopoulos: ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [15:02:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10919002 (10MoritzMuehlenhoff) [15:02:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:03:05] msazbo: Where are you in the deployment now? [15:03:15] *mszabo: [15:03:34] Dreamy_Jazz: any second now [15:03:51] 3 2 1 [15:04:04] (03CR) 10Gkyziridis: [C:03+1] ores-extension: enable extension with revertrisk filter for the third batch of wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [15:04:05] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159438|Add missing labels for email confirmation reminder preferences (T58074)]] (duration: 53m 29s) [15:04:09] boom [15:04:09] T58074: Echo: Generate periodic web notification to nudge users to confirm an unverified email address - https://phabricator.wikimedia.org/T58074 [15:04:16] (03CR) 10Ilias Sarantopoulos: "I removed them!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [15:04:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159477 (owner: 10Dreamy Jazz) [15:05:38] (03Merged) 10jenkins-bot: Revert "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159477 (owner: 10Dreamy Jazz) [15:05:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T395241)', diff saved to https://phabricator.wikimedia.org/P78062 and previous config saved to /var/cache/conftool/dbconfig/20250616-150541-fceratto.json [15:05:52] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1159477|Revert "Enable temporary accounts onboarding dialog on WMF wikis"]] [15:06:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [15:06:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T395241)', diff saved to https://phabricator.wikimedia.org/P78063 and previous config saved to /var/cache/conftool/dbconfig/20250616-150609-fceratto.json [15:06:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156515 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:48] FIRING: [7x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1112:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:07:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:09:55] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1159477|Revert "Enable temporary accounts onboarding dialog on WMF wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:11:26] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) [15:11:48] RESOLVED: [2x] PuppetZeroResources: Puppet has failed generate resources on cirrussearch1115:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:14:32] (03CR) 10Herron: [C:03+1] thanos: enable tracing for rule [puppet] - 10https://gerrit.wikimedia.org/r/1159355 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [15:16:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T395241)', diff saved to https://phabricator.wikimedia.org/P78064 and previous config saved to /var/cache/conftool/dbconfig/20250616-151641-fceratto.json [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:27] !log bking@cumin2002:~$ sudo cumin A:lvs-low-traffic 'run-puppet-agent' T387569 [15:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:31] T387569: Update Elastic puppet code to filter LVS config based on cluster membership - https://phabricator.wikimedia.org/T387569 [15:17:51] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye [15:18:40] (03PS1) 10Effie Mouzeli: site.pp: add wikikube-worker-exp(1001|2001) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) [15:19:22] (03CR) 10CI reject: [V:04-1] site.pp: add wikikube-worker-exp(1001|2001) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:20:04] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet [15:20:35] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrading clouddbs T394372 [15:20:39] T394372: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372 [15:22:03] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [15:23:22] (03PS2) 10Effie Mouzeli: site.pp: add wikikube-worker-exp(1001|2001) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) [15:23:30] (03CR) 10FNegri: [C:03+2] clouddb1013: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1154804 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [15:23:43] 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10919161 (10elukey) 05Resolved→03Open [15:23:56] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10919163 (10elukey) [15:25:33] (03Abandoned) 10Effie Mouzeli: mw-experimental-mediawiki-image-update: improve script [puppet] - 10https://gerrit.wikimedia.org/r/1153999 (owner: 10Effie Mouzeli) [15:27:09] (03CR) 10Effie Mouzeli: [C:03+1] memcached: Switch to profile::memcached::firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156269 (owner: 10Muehlenhoff) [15:27:18] (03CR) 10Effie Mouzeli: [C:03+1] memcached/gutter: Switch to firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156659 (owner: 10Muehlenhoff) [15:27:42] (03CR) 10Effie Mouzeli: [C:03+1] mediawiki/memcached: Switch to firewall_src_sets [puppet] - 10https://gerrit.wikimedia.org/r/1156669 (owner: 10Muehlenhoff) [15:28:52] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10919213 (10elukey) I am reopening this task since I assumed something about https://wikitech.wikimedia.org/wiki/SLO/Citoid without reading it cor... [15:29:46] !log decommissioning sessionstore2004-a/Cassandra — T391544 [15:29:50] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159503 (https://phabricator.wikimedia.org/T128546) [15:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:51] T391544: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544 [15:30:05] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1530). [15:30:40] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159477|Revert "Enable temporary accounts onboarding dialog on WMF wikis"]] (duration: 24m 48s) [15:31:39] 👋Just fyi, I'm actually going to do the Portals update today [15:31:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P78065 and previous config saved to /var/cache/conftool/dbconfig/20250616-153148-fceratto.json [15:34:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 21.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:35:51] (03CR) 10Eevans: [C:03+2] sessionstore2004: reimage as JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1153150 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [15:37:19] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [15:38:28] FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:39:42] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159503 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:40:23] (03PS1) 10Bking: cirrussearch: remove non-existent hosts [puppet] - 10https://gerrit.wikimedia.org/r/1159507 (https://phabricator.wikimedia.org/T388610) [15:40:39] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159503 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:41:15] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [15:42:20] (03PS1) 10Jgiannelos: RB sunset: debug spike in changeprop events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159508 [15:42:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10919278 (10WMDE-leszek) Thank @herron, missed part of the template it seems. It is about a time-limited contract. Updating task description [15:43:25] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10919281 (10WMDE-leszek) [15:43:28] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:43:41] (03CR) 10Bking: [C:03+2] cirrussearch: remove non-existent hosts [puppet] - 10https://gerrit.wikimedia.org/r/1159507 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:44:17] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time (this is blocking a more important change)" [puppet] - 10https://gerrit.wikimedia.org/r/1159507 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [15:44:38] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum1001 [puppet] - 10https://gerrit.wikimedia.org/r/1159486 (owner: 10Ssingh) [15:46:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P78066 and previous config saved to /var/cache/conftool/dbconfig/20250616-154656-fceratto.json [15:47:01] (03CR) 10Gkyziridis: [C:03+1] ores-extension: enable extension with revertrisk filter for the third batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [15:47:02] (03PS1) 10Vgutierrez: hiera: Issue a separate LE cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159510 (https://phabricator.wikimedia.org/T394484) [15:47:09] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet [15:47:40] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet [15:47:54] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum1001.eqiad.wmnet with OS bookworm [15:48:10] (03PS2) 10Vgutierrez: hiera: Issue a separate LE cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159510 (https://phabricator.wikimedia.org/T394484) [15:49:18] (03PS1) 10Vgutierrez: hiera: Issue a separate GTS cert for upload cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/1159511 (https://phabricator.wikimedia.org/T394484) [15:49:28] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159510 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:49:33] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [15:49:44] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.c... [15:51:29] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1159511 (https://phabricator.wikimedia.org/T394484) (owner: 10Vgutierrez) [15:52:03] (03PS1) 10Scott French: Revert^4 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1159513 (https://phabricator.wikimedia.org/T389786) [15:52:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 10.64.48.95 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:53:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155652 (https://phabricator.wikimedia.org/T395824) (owner: 10Gkyziridis) [15:53:29] !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1159503| Bumping portals to master (T128546)]] (duration: 09m 21s) [15:53:33] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:55:49] !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1159503| Bumping portals to master (T128546)]] (duration: 02m 19s) [15:55:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [15:56:27] !log dancy@deploy1003 Installing scap version "4.175.0" for 2 host(s) [15:57:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.48.95 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:57:13] (03CR) 10Clément Goubert: [C:03+1] Revert^4 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1159513 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [15:58:17] !log dancy@deploy1003 Installation of scap version "4.175.0" completed for 2 hosts [15:58:28] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:57] (03CR) 10Btullis: [C:03+1] Revert^4 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1159513 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [16:01:19] (03PS2) 10Jgiannelos: RB sunset: debug spike in changeprop events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159508 [16:02:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T395241)', diff saved to https://phabricator.wikimedia.org/P78067 and previous config saved to /var/cache/conftool/dbconfig/20250616-160203-fceratto.json [16:02:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [16:02:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T395241)', diff saved to https://phabricator.wikimedia.org/P78068 and previous config saved to /var/cache/conftool/dbconfig/20250616-160220-fceratto.json [16:03:06] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage [16:03:36] (03CR) 10Hnowlan: [C:03+1] RB sunset: debug spike in changeprop events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159508 (owner: 10Jgiannelos) [16:04:59] (03PS1) 10Bking: cirrussearch: move soon-to-be-decommed hosts to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1159517 (https://phabricator.wikimedia.org/T395855) [16:05:09] (03PS1) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) [16:05:41] (03CR) 10Bking: [C:04-1] "Do not merge until we safely remove these hosts from the cluster." [puppet] - 10https://gerrit.wikimedia.org/r/1159517 (https://phabricator.wikimedia.org/T395855) (owner: 10Bking) [16:05:48] (03CR) 10Jgiannelos: [C:03+2] RB sunset: debug spike in changeprop events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159508 (owner: 10Jgiannelos) [16:06:40] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1001.eqiad.wmnet with reason: host reimage [16:07:37] (03Merged) 10jenkins-bot: RB sunset: debug spike in changeprop events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159508 (owner: 10Jgiannelos) [16:08:28] (03PS1) 10Effie Mouzeli: hieradata: make wikikube-worker2100 a normal worker [puppet] - 10https://gerrit.wikimedia.org/r/1159519 [16:08:54] (03PS1) 10Ebernhardson: Turn off glent m1 AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159520 (https://phabricator.wikimedia.org/T262612) [16:09:26] jouncebot: nowandnext [16:09:26] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [16:09:26] In 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1700) [16:09:27] In 0 hour(s) and 50 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1700) [16:09:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159520 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [16:09:47] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [16:09:50] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:09:57] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [16:10:04] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [16:10:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:10:48] (03CR) 10Muehlenhoff: site.pp: make wikikube-worker-exp* k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:11:05] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [16:11:47] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [16:11:53] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:12:18] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:12:24] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [16:12:33] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:13:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T395241)', diff saved to https://phabricator.wikimedia.org/P78069 and previous config saved to /var/cache/conftool/dbconfig/20250616-161303-fceratto.json [16:13:28] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:32] (03PS2) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) [16:14:13] (03PS3) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) [16:14:44] (03CR) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:18:28] RESOLVED: ProbeDown: Service sessionstore2004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:21:18] (03PS1) 10Effie Mouzeli: kubernetes.yaml: switch mw-experimental to debug image [puppet] - 10https://gerrit.wikimedia.org/r/1159524 (https://phabricator.wikimedia.org/T396767) [16:23:27] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1001.eqiad.wmnet with OS bookworm [16:23:57] (03PS1) 10Clare Ming: xLab: Deploy v0.7.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159526 (https://phabricator.wikimedia.org/T392898) [16:27:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.48.95 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:28:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P78070 and previous config saved to /var/cache/conftool/dbconfig/20250616-162810-fceratto.json [16:30:21] (03PS2) 10Cathal Mooney: Rename cloud-in to cloud-vrf-in [homer/public] - 10https://gerrit.wikimedia.org/r/1159415 [16:30:58] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159526 (https://phabricator.wikimedia.org/T392898) (owner: 10Clare Ming) [16:32:28] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159526 (https://phabricator.wikimedia.org/T392898) (owner: 10Clare Ming) [16:37:25] (03PS1) 10Eevans: sessionstore: use correct partman preseed [puppet] - 10https://gerrit.wikimedia.org/r/1159530 (https://phabricator.wikimedia.org/T391544) [16:39:41] (03CR) 10Eevans: [C:03+2] sessionstore: use correct partman preseed [puppet] - 10https://gerrit.wikimedia.org/r/1159530 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [16:41:07] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum1002 [puppet] - 10https://gerrit.wikimedia.org/r/1159487 (owner: 10Ssingh) [16:41:15] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum2001 [puppet] - 10https://gerrit.wikimedia.org/r/1159488 (owner: 10Ssingh) [16:41:58] (03PS1) 10Hnowlan: Revert "changeprop: Remove rules related to parsoid (RB sunset)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159535 (https://phabricator.wikimedia.org/T397072) [16:42:20] eevans@cumin1003 reimage (PID 1845984) is awaiting input [16:43:17] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum1002.eqiad.wmnet with OS bookworm [16:43:18] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum2001.codfw.wmnet with OS bookworm [16:43:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P78071 and previous config saved to /var/cache/conftool/dbconfig/20250616-164317-fceratto.json [16:43:59] (03CR) 10Eevans: [C:03+2] sessionstore2004: reconfigure for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1153151 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [16:44:25] andrew@cumin1002 reimage (PID 2661898) is awaiting input [16:44:47] (03PS2) 10Eevans: sessionstore2004: reconfigure for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1153151 (https://phabricator.wikimedia.org/T391544) [16:45:02] !log eevans@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2004.codfw.wmnet with OS bullseye [16:45:15] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw... [16:45:30] (03CR) 10Scott French: [C:03+1] site.pp: add wikikube-worker-exp(1001|2001) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:46:10] PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100% [16:46:18] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [16:46:28] (03CR) 10Eevans: [C:03+2] sessionstore2004: reconfigure for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1153151 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [16:46:34] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919692 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.c... [16:47:52] sukhe: hi, that seems to be a durum issue. should I take a look? [16:48:06] talking about the icinga alert above for wikimedia-dns.org [16:48:10] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:48:11] mutante: hi, no worries, this is a bit weird though because the other hosts should be up [16:48:18] checking and thanks [16:48:25] sukhe: alright, thanks as well [16:48:55] we have one each host in eqiad and codfw up for example and serving traffic [16:49:32] the DNS lookup of check.wikimedia-dns.org works for me :P [16:50:13] so v6 ping fails from the Icinga host but v4 works. that's weird though, because the other hosts are advertising the v6 IP [16:50:18] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-eqsin and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [16:50:18] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=97) rolling upgrade of Varnish on A:cp-eqsin and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [16:50:22] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [16:50:23] except the v6 reverse record does not exist [16:50:23] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-eqsin and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [16:51:04] hmm weird [16:51:39] My bad! [16:51:41] no recent changes in DNS repo [16:51:42] (not the v6 records) [16:52:26] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10919719 (10VRiley-WMF) After looking at this unit, it seems like the server is healthy. @Eevans can you confirm these drives are actually bad? If so, which drives need to be replaced? [16:52:43] so yeah, durum2002 is advertising the v6'es alright [16:52:59] brett: ssh-keygen -f '/home/mutante/.ssh/known_hosts.d/wmf-prod' -R 'durum2002.codfw.wmnet' [16:53:20] eh.. that was entirely a bad paste. sorry [16:53:26] okay, was confused :) [16:53:31] ignore:) [16:53:50] (03PS1) 10Eevans: sessionstore2004: expand configuration w/ 4 new devices [puppet] - 10https://gerrit.wikimedia.org/r/1159537 (https://phabricator.wikimedia.org/T391544) [16:54:19] Are the bfd alarms intentional? [16:54:32] they are expected for sure, given the durum hosts reimaging [16:54:41] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1155190 (https://phabricator.wikimedia.org/T394318) (owner: 10Filippo Giunchedi) [16:54:41] sweet, thanks [16:54:53] 10.192.32.58 for example is durum2001 [16:55:10] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sessionstore2004.codfw.wmnet with OS bullseye [16:55:31] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sessionstore2004.codfw... [16:55:55] sukhe: curl -6 https://check.wikimedia-dns.org [16:56:02] works for me as well from durum2002 [16:56:23] just not from alert1002 [16:56:25] firewalling? [16:56:26] yeah it's weird for sure... [16:57:05] ok let's see when the two durum hosts come up [16:57:11] not just TCP, also ICMP / ping6 is dropped [16:57:16] ack [16:57:18] because I can't reach the v6 from Icinga but can reach v4 [16:57:18] yeah [16:58:00] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [16:58:21] (03CR) 10Effie Mouzeli: kubernetes: create mediawiki_experimental profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [16:58:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T395241)', diff saved to https://phabricator.wikimedia.org/P78072 and previous config saved to /var/cache/conftool/dbconfig/20250616-165825-fceratto.json [16:58:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [16:58:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T395241)', diff saved to https://phabricator.wikimedia.org/P78073 and previous config saved to /var/cache/conftool/dbconfig/20250616-165855-fceratto.json [16:59:10] (03CR) 10Effie Mouzeli: profile::kubernetes::mediawiki_experimental: properly update latest image (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [16:59:55] (03PS3) 10Effie Mouzeli: site.pp: add wikikube-worker-exp(1001|2001) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) [17:00:04] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1700). [17:00:04] ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T1700). [17:00:56] (03CR) 10Scott French: [C:03+1] "This seems to combine steps #2 and #3 from [0]. Do we actually want that? (i.e., do we want to refrain from adding to the conftool entitie" [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [17:01:10] o/ [17:01:41] o/ [17:02:52] (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1159513 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [17:02:56] (03CR) 10Scott French: [C:03+2] Revert^4 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1159513 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [17:03:11] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [17:03:31] btullis: I didn't realize you'd be around, thank you :) [17:03:35] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage [17:04:47] (03PS4) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) [17:05:43] (03PS5) 10Effie Mouzeli: site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) [17:05:59] (03CR) 10Effie Mouzeli: "good point!" [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [17:06:27] swfrench-wmf: is it 4th time lucky? :-) [17:06:52] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [17:07:05] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919781 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host sessionstore2004.c... [17:07:12] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2001.codfw.wmnet with reason: host reimage [17:07:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T395241)', diff saved to https://phabricator.wikimedia.org/P78074 and previous config saved to /var/cache/conftool/dbconfig/20250616-170726-fceratto.json [17:07:41] sukhe: on durum2002, if I do a "nft list ruleset | less" and look at the PRODUCTION_NETWORKS_ipv6 there are a bunch of networks including 2620:0:861:300:* but does it NOT cover 2620:0:861:3:* which is the one bound on alert1002? or am I not missing it [17:07:56] eh, not getting it/missing it:) [17:11:12] !log swfrench@deploy1003 Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - T389786 [17:11:16] T389786: Integrate mediawiki-dumps-legacy with the regular MW scap deployments - https://phabricator.wikimedia.org/T389786 [17:11:31] mutante: durum2002 has ip6 saddr { 2620:0:860:2:208:80:153:42, 2620:0:860:102:10:192:16:75, 2620:0:860:103:10:192:32:67, 2620:0:860:10a:10:192:9:11, 2620:0:860:11e:10:192:39:10, 2620:0:861:3:208:80:154:78 } udp dport 1-65535 accept [17:12:02] which covers the Icinga host I think [17:12:49] and durum2002 works from the Icinga host, so there's that too. [17:12:55] !log swfrench@deploy1003 Finished scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - T389786 (duration: 02m 15s) [17:13:10] so durum1002 and 2001 should be coming up [17:13:12] let's see then [17:13:24] btullis: I think we got it this time :) [17:13:28] I'll follow up on the task [17:14:11] FYI, I'm done with planned changes for the UTC-late infra window [17:14:20] yea, you are right [17:15:26] (03CR) 10Eevans: [C:03+2] sessionstore2004: expand configuration w/ 4 new devices [puppet] - 10https://gerrit.wikimedia.org/r/1159537 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [17:16:37] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [17:17:00] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [17:20:14] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1002.eqiad.wmnet with OS bookworm [17:20:19] (03PS1) 10Dzahn: microsites: delete os_reports profile, remove from role [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T350794) [17:20:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:22:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P78075 and previous config saved to /var/cache/conftool/dbconfig/20250616-172234-fceratto.json [17:22:42] (03PS2) 10Dzahn: microsites: delete os_reports profile, remove from role [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080) [17:23:51] swfrench-wmf: Ack, many thanks. [17:24:50] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage [17:25:26] (03CR) 10Dzahn: [C:04-1] "The monitoring code needs to be moved and adjusted to k8s first.. let me merge it into miscweb::monitoring and then move that whole file e" [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [17:25:39] RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms [17:25:46] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2001.codfw.wmnet with OS bookworm [17:25:50] (03PS1) 10Ssingh: hiera: durum: set hc for canonical domain to / [puppet] - 10https://gerrit.wikimedia.org/r/1159541 [17:25:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:25:54] interesting, so it is the single instance [17:26:31] mutante: I have not isolated it yet but it was certainly monitoring. I could access the service over v6 and well, durum2002 was up [17:26:44] but yeah, updating the health checks above, just in case [17:27:27] so the actual issue is it should have been using the other durum instance when one goes down? [17:27:33] yes [17:27:37] gotcha [17:27:51] mutante: basically since it's an anycast service [17:28:02] so all 4x durum hosts (2x per eqiad/codfw) advertise the same IPs [17:28:02] yea [17:28:03] (03PS1) 10Bernard Wang: Enable new mobile search experience everywhere (not including empty search recommendations) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 [17:28:07] rather, all 14 durum hosts! [17:28:51] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sessionstore2004.codfw.wmnet with reason: host reimage [17:29:09] and durum2002 was up (it has not been reimaged) and _should_ have been reachable from icinga [17:29:12] and the ping -6 was [17:29:18] and even the host itself [17:29:24] but why not the domain specifically, not sure [17:29:50] and the DNS record for it is not DYNA or anything, so it does not depend on where it is coming from [17:31:43] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5969/c" [puppet] - 10https://gerrit.wikimedia.org/r/1159541 (owner: 10Ssingh) [17:32:38] (03CR) 10BCornwall: [C:03+1] hiera: durum: set hc for canonical domain to / [puppet] - 10https://gerrit.wikimedia.org/r/1159541 (owner: 10Ssingh) [17:32:51] thanks brett [17:33:00] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: durum: set hc for canonical domain to / [puppet] - 10https://gerrit.wikimedia.org/r/1159541 (owner: 10Ssingh) [17:33:15] !log disable puppet on A:durum to roll out CR 1159541 [17:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:28] (03CR) 10Scott French: [C:03+1] site.pp: add wikikube-worker-exp(1001|2001) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159502 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [17:34:53] sukhe: possibly this: "When Prometheus (or another client) queries the Alertmanager anycast DNS address for health status, it will only reach the closest instance. " [17:35:15] arr, that's AI though making this claim, sorry [17:35:25] mutante: hah well in this case, I did query the direct IP as well [17:35:32] (03CR) 10Scott French: [C:03+1] site.pp: make wikikube-worker-exp* k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1159518 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [17:36:05] !log sudo cumin -b1 -s10 'A:durum' 'run-puppet-agent --enable "merging CR 1159541"' [17:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:06] PROBLEM - Host check.wikimedia-dns.org is DOWN: PING CRITICAL - Packet loss = 100% [17:37:23] ^ ha [17:37:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P78076 and previous config saved to /var/cache/conftool/dbconfig/20250616-173741-fceratto.json [17:37:43] (03CR) 10Scott French: [C:03+1] "Thanks, Effie! Agreed with the commit message that a reimage is the cleanest cleanup option." [puppet] - 10https://gerrit.wikimedia.org/r/1159519 (owner: 10Effie Mouzeli) [17:37:44] :( [17:37:47] so very clearly for some reaosn, it only cares about durum1001 [17:37:50] that's a fun discovery [17:39:21] hrmm... #netops #anycast_routing [17:39:28] ha yeah [17:39:34] will first debug and then see [17:42:52] should be up soon but yeah, does not answer the question of why it does not try to reach codfw [17:43:25] I know we do't advertise prefixes from PoPs to core but this is not that [17:43:55] RESOLVED: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.32.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:44:14] RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [17:45:18] (03PS3) 10Cwhite: logstash: Reroute apache.access logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/1153279 (https://phabricator.wikimedia.org/T390215) [17:45:42] (03CR) 10Scott French: [C:03+1] "Thanks, Effie! Feel free to merge without an additional round of review from me once you resolve (or decide to defer) the open comment abo" [puppet] - 10https://gerrit.wikimedia.org/r/1156392 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [17:47:45] (03PS1) 10Dzahn: microsites: adjust monitoring for os_reports to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1159545 (https://phabricator.wikimedia.org/T397080) [17:48:27] (03CR) 10Dzahn: [C:04-1] "before this, there should be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1159545" [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [17:49:31] FIRING: ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:50:26] urandom: Expected? [17:50:30] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum2002 [puppet] - 10https://gerrit.wikimedia.org/r/1159489 (owner: 10Ssingh) [17:50:39] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum2002.codfw.wmnet with OS bookworm [17:50:44] * swfrench-wmf is willing to guess yes, but confirmation would be good [17:50:57] urandom was working on this host re: partman issues [17:51:12] https://sal.toolforge.org/log/q-tceZcB8tZ8Ohr0YV3e [17:51:19] so I would say yes [17:52:06] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum6001 [puppet] - 10https://gerrit.wikimedia.org/r/1159494 (owner: 10Ssingh) [17:52:27] (03CR) 10AOkoth: microsites: adjust monitoring for os_reports to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159545 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [17:52:37] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sessionstore2004.codfw.wmnet with OS bullseye [17:52:44] brett: ^ [17:52:48] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10919958 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host sess... [17:52:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T395241)', diff saved to https://phabricator.wikimedia.org/P78077 and previous config saved to /var/cache/conftool/dbconfig/20250616-175248-fceratto.json [17:53:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance [17:53:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T395241)', diff saved to https://phabricator.wikimedia.org/P78078 and previous config saved to /var/cache/conftool/dbconfig/20250616-175317-fceratto.json [17:53:28] FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:32] (03CR) 10Ssingh: "I would say that Colombia, Venezuela, Ecuador are the only ones we should consider merging. For the rest, let's iterate one by one. Let me" [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [17:55:10] FIRING: [2x] BFDdown: BFD session down between cr2-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:57:06] PROBLEM - Host sessionstore2004 is DOWN: PING CRITICAL - Packet loss = 100% [17:58:34] RECOVERY - Host sessionstore2004 is UP: PING OK - Packet loss = 0%, RTA = 30.55 ms [17:59:30] bird/BGP/BFD alerts are expected for the durum hosts. I will point out the non-obvious oes. [17:59:33] *ones [18:00:10] FIRING: [4x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:00:18] ^ expected [18:00:19] (03CR) 10Scott French: [C:03+1] kubernetes.yaml: switch mw-experimental to debug image [puppet] - 10https://gerrit.wikimedia.org/r/1159524 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [18:00:54] (03PS2) 10Dzahn: microsites: adjust monitoring for os_reports to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1159545 (https://phabricator.wikimedia.org/T397080) [18:01:05] (03CR) 10Ssingh: [C:03+1] hiera: Add lvs1016 to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [18:01:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T395241)', diff saved to https://phabricator.wikimedia.org/P78079 and previous config saved to /var/cache/conftool/dbconfig/20250616-180141-fceratto.json [18:04:25] (03CR) 10Dzahn: microsites: adjust monitoring for os_reports to k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1159545 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [18:04:28] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 6 hosts with reason: begin decom/remove hosts from cluster [18:04:50] (03CR) 10BCornwall: [C:03+2] hiera: Add lvs1016 to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/1153418 (https://phabricator.wikimedia.org/T387145) (owner: 10BCornwall) [18:05:02] (03CR) 10AOkoth: [C:03+2] microsites: adjust monitoring for os_reports to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1159545 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [18:06:46] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye [18:07:38] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum4001 [puppet] - 10https://gerrit.wikimedia.org/r/1159490 (owner: 10Ssingh) [18:08:23] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum4001.ulsfo.wmnet with OS bookworm [18:08:25] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage [18:12:27] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage [18:13:30] !log bootstrapping sessionstore2004-a/Cassandra — T390514 [18:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:31] FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:15:10] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:15:38] ^ expected [18:16:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P78080 and previous config saved to /var/cache/conftool/dbconfig/20250616-181649-fceratto.json [18:20:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920023 (10BCornwall) Okay, so we're ready to reimage lvs1016 but it appears that the mgmt interface isn't reachable. Could dcops look into this, please? [18:22:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920042 (10BCornwall) [18:23:38] (03PS3) 10Dzahn: microsites: delete os_reports profile, remove from role [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080) [18:28:38] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [18:28:59] (03CR) 10AOkoth: [C:03+1] microsites: delete os_reports profile, remove from role [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [18:29:42] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2002.codfw.wmnet with OS bookworm [18:30:10] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:30:24] ^ going away shortly [18:31:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P78081 and previous config saved to /var/cache/conftool/dbconfig/20250616-183156-fceratto.json [18:32:15] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4001.ulsfo.wmnet with reason: host reimage [18:32:30] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10920067 (10jcrespo) Thanks, that's more insightful and helpful, I will give it a think and maybe talk to Matthew and will try to w... [18:33:49] (03CR) 10BCornwall: [C:03+1] wikimedia-dns.org: remove TYPE65 record for check [dns] - 10https://gerrit.wikimedia.org/r/1156373 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:35:10] FIRING: [8x] BFDdown: BFD session down between cr1-codfw and 10.192.48.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:39:14] (03CR) 10Dzahn: [C:03+2] microsites: delete os_reports profile, remove from role [puppet] - 10https://gerrit.wikimedia.org/r/1159538 (https://phabricator.wikimedia.org/T397080) (owner: 10Dzahn) [18:41:41] checking why 10.192.48.14 continues to be a problem [18:41:52] it is advertising all the right things, BGP session is up [18:43:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:43:28] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10920086 (10KFrancis) @herron I am confirming an NDA is on file for Andrew Green. Thanks! [18:43:34] oh yeah, it did clear up [18:43:38] artifcat [18:45:03] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:47:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T395241)', diff saved to https://phabricator.wikimedia.org/P78082 and previous config saved to /var/cache/conftool/dbconfig/20250616-184704-fceratto.json [18:47:15] (03PS1) 10BCornwall: Revert "hiera: Add lvs1016 to high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/1159549 [18:47:24] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2221.codfw.wmnet with reason: Maintenance [18:47:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2221 (T395241)', diff saved to https://phabricator.wikimedia.org/P78083 and previous config saved to /var/cache/conftool/dbconfig/20250616-184731-fceratto.json [18:47:34] (03CR) 10Ssingh: [C:03+1] Revert "hiera: Add lvs1016 to high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/1159549 (owner: 10BCornwall) [18:48:55] (03CR) 10BCornwall: [C:03+2] Revert "hiera: Add lvs1016 to high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/1159549 (owner: 10BCornwall) [18:50:07] (03CR) 10Kimberly Sarabia: "is there a ticket for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159542 (owner: 10Bernard Wang) [18:50:27] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4001.ulsfo.wmnet with OS bookworm [18:50:43] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum4002 [puppet] - 10https://gerrit.wikimedia.org/r/1159491 (owner: 10Ssingh) [18:50:43] (03PS1) 10Dzahn: miscweb: delete miscweb::rsync profile [puppet] - 10https://gerrit.wikimedia.org/r/1159550 (https://phabricator.wikimedia.org/T397080) [18:53:17] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum4002.ulsfo.wmnet with OS bookworm [18:55:10] RESOLVED: [4x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:56:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T395241)', diff saved to https://phabricator.wikimedia.org/P78084 and previous config saved to /var/cache/conftool/dbconfig/20250616-185600-fceratto.json [18:58:10] FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:00:06] (03CR) 10Scott French: "Thanks, Effie! One lingering issue and optional simplification you should feel free to defer. Please feel free to merge without additional" [puppet] - 10https://gerrit.wikimedia.org/r/1156410 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [19:01:49] (03CR) 10Ssingh: [C:03+2] hiera: set do_ech to false for durum5001 [puppet] - 10https://gerrit.wikimedia.org/r/1159492 (owner: 10Ssingh) [19:02:31] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum5001.eqsin.wmnet with OS bookworm [19:07:14] (03CR) 10CDobbins: "Sounds good to me. I'll update the change." [dns] - 10https://gerrit.wikimedia.org/r/1153334 (owner: 10CDobbins) [19:08:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:08:28] RESOLVED: ProbeDown: Service sessionstore2004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:08:32] I did silence the above :P [19:09:46] (03PS11) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [19:11:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P78085 and previous config saved to /var/cache/conftool/dbconfig/20250616-191108-fceratto.json [19:11:09] (03CR) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [19:12:24] (03PS3) 10Ssingh: wikimedia-dns.org: remove TYPE65 record for check [dns] - 10https://gerrit.wikimedia.org/r/1156373 (https://phabricator.wikimedia.org/T205378) [19:12:32] (03CR) 10Ssingh: "Rebased, no code change." [dns] - 10https://gerrit.wikimedia.org/r/1156373 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:13:44] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage [19:14:11] (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: remove TYPE65 record for check [dns] - 10https://gerrit.wikimedia.org/r/1156373 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:14:23] !log sukhe@dns1004 START - running authdns-update [19:15:16] !log sukhe@dns1004 END - running authdns-update [19:16:45] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum4002.ulsfo.wmnet with reason: host reimage [19:19:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AndyRussG - https://phabricator.wikimedia.org/T397004#10920139 (10AndyRussG_volunteer) Thanks so much, @WMDE-leszek, @herron, @KFrancis, hugely appreciated. - I signed L3 with using this, my volunteer, account. (As you can see,... [19:26:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P78086 and previous config saved to /var/cache/conftool/dbconfig/20250616-192615-fceratto.json [19:34:16] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum4002.ulsfo.wmnet with OS bookworm [19:34:42] (03CR) 10Cwhite: [C:03+2] logstash: Reroute apache.access logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/1153279 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [19:35:21] (03PS1) 10Zabe: Stop setting wgRevisionSlotsCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159552 (https://phabricator.wikimedia.org/T183490) [19:41:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T395241)', diff saved to https://phabricator.wikimedia.org/P78087 and previous config saved to /var/cache/conftool/dbconfig/20250616-194123-fceratto.json [19:41:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2222.codfw.wmnet with reason: Maintenance [19:41:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2222 (T395241)', diff saved to https://phabricator.wikimedia.org/P78088 and previous config saved to /var/cache/conftool/dbconfig/20250616-194140-fceratto.json [19:42:34] (03PS1) 10Bvibber: Update chart-renderer in production to latest merge build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159556 (https://phabricator.wikimedia.org/T395165) [19:45:13] (03CR) 10Bvibber: [C:03+2] Update chart-renderer in production to latest merge build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159556 (https://phabricator.wikimedia.org/T395165) (owner: 10Bvibber) [19:45:40] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage [19:46:58] (03Merged) 10jenkins-bot: Update chart-renderer in production to latest merge build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159556 (https://phabricator.wikimedia.org/T395165) (owner: 10Bvibber) [19:49:24] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum5001.eqsin.wmnet with reason: host reimage [19:50:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T395241)', diff saved to https://phabricator.wikimedia.org/P78089 and previous config saved to /var/cache/conftool/dbconfig/20250616-195004-fceratto.json [19:50:29] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp70[02-16].magru.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [19:50:33] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [19:50:56] !log bvibber@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply [19:51:29] !log bvibber@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [19:51:39] !log bvibber@deploy1003 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [19:52:05] !log bvibber@deploy1003 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [19:52:13] !log bvibber@deploy1003 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [19:52:43] !log bvibber@deploy1003 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [19:59:29] (03PS1) 10Bvibber: Quiet test rollout of Lua transforms for Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159559 (https://phabricator.wikimedia.org/T388616) [19:59:31] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 1h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T2000). nyaa~ [20:00:05] Nemoralis, arlolra, EggRoll97, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:03:40] \o [20:03:53] here [20:04:16] here [20:05:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P78090 and previous config saved to /var/cache/conftool/dbconfig/20250616-200512-fceratto.json [20:05:27] I can deploy if no one else is here [20:06:01] I can handle my own deploy and am also willing to do for others [20:07:26] Alright, feel free to do it then [20:07:51] Is Nemoralis around? [20:08:09] If not, I'll get started with mine [20:08:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156515 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [20:09:07] !log restarting pybal on lvs1020 [20:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:31] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1020 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [20:09:46] (03Merged) 10jenkins-bot: Disable VipsScaler in group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1156515 (https://phabricator.wikimedia.org/T290759) (owner: 10Arlolra) [20:09:47] !log restarting pybal on lvs1017 [20:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:01] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1156515|Disable VipsScaler in group2 (T290759)]] [20:10:06] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [20:11:59] !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1156515|Disable VipsScaler in group2 (T290759)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:12:01] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum5001.eqsin.wmnet with OS bookworm [20:12:47] Looks like I broke something in the SpiderPig job log viewer. Looking into it. [20:13:00] Thanks [20:13:06] !log arlolra@deploy1003 arlolra: Continuing with sync [20:13:10] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:13:28] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:49] ^ last alert, artifact from earlier. should resolve soon. the BFD one that is [20:17:40] !log T395855 Stopped opensearch units on `cirrussearch205[7,8]` (row B decom hosts) [20:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:44] T395855: Decommission cirrussearch2055-2060 - https://phabricator.wikimedia.org/T395855 [20:18:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:19:54] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1156515|Disable VipsScaler in group2 (T290759)]] (duration: 09m 53s) [20:20:00] T290759: Undeploy VipsScaler from Wikimedia wikis - https://phabricator.wikimedia.org/T290759 [20:20:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222', diff saved to https://phabricator.wikimedia.org/P78091 and previous config saved to /var/cache/conftool/dbconfig/20250616-202019-fceratto.json [20:21:26] EggRoll97: You're up next [20:21:31] Yep [20:26:20] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon2006-dev.codfw.wmnet with OS bullseye [20:27:48] EggRoll97: Is there any sort of review process that needs to happen for ipinfo-view-full to be granted? [20:28:33] arlolra: Shouldn't be, afaik it's redundant without checkuser-temporary-account [20:29:34] And given arbcom is elected and ipinfo-view-full is a subset of admin I didn't see any problem with it at the time, only checkuser-temporary-account is specifically blocked from being added to other groups in Limits to config changes [20:29:35] Just that none of the other arbcom have that [20:32:00] I think the other arbcoms were created before ipinfo-view-full was necessarily relevant to usergroups [20:32:18] Arbcom groups*, sorry [20:32:53] It looks like zhwiki opted to deploy without it to start [20:32:53] https://phabricator.wikimedia.org/T374455#10136177 [20:33:13] I see, will do [20:33:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920481 (10VRiley-WMF) @BCornwall Hey there, thanks for letting us know. I did replace the cable and it seems to respond to ping. Would you be able to check again? It seems to... [20:34:31] EggRoll97: Also, https://phabricator.wikimedia.org/T374528 [20:34:42] oathauth-enable might be unnecessary [20:35:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2222 (T395241)', diff saved to https://phabricator.wikimedia.org/P78092 and previous config saved to /var/cache/conftool/dbconfig/20250616-203526-fceratto.json [20:36:57] arlolra: oathauth-enable being removed in T374528 only appears to affect itwiki and newiki, and arbcom wouldn't necessarily be a privileged group (especially if the arbcom members arent in the sysop group or similar) may not be redundant yet [20:36:58] T374528: Remove redundant oathauth-enable flag - https://phabricator.wikimedia.org/T374528 [20:38:22] Ok [20:38:40] Do you want me to amend the patch or will you push PS2? [20:39:50] (03PS12) 10CDobbins: add rest of South America (except Falkland Islands) to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/1153334 [20:39:56] No preference either way, it may take me a couple minutes to push PS2 unless you amend it [20:41:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920518 (10BCornwall) @VRiley-WMF Thanks for the quick response! I've not been able to ping the mgmt interface (10.65.0.75) from lvs1017, cumin1002, and cumin2002. It's timin... [20:41:19] (03PS2) 10Arlolra: Add arbcom group to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [20:41:31] EggRoll97: done [20:42:07] (03CR) 10EggRoll97: [C:03+1] Add arbcom group to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [20:42:21] arlolra: thanks [20:42:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [20:43:23] (03Merged) 10jenkins-bot: Add arbcom group to ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155945 (https://phabricator.wikimedia.org/T396668) (owner: 10EggRoll97) [20:43:35] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1155945|Add arbcom group to ukwiki (T396668)]] [20:43:40] T396668: Add user group arbcom to ukwiki - https://phabricator.wikimedia.org/T396668 [20:44:12] (03PS6) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [20:44:37] (03CR) 10CI reject: [V:04-1] varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins) [20:45:07] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:45:30] !log arlolra@deploy1003 arlolra, eggroll97: Backport for [[gerrit:1155945|Add arbcom group to ukwiki (T396668)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:45:32] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [20:45:54] (03PS7) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [20:46:26] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:46:33] EggRoll97: Is there anything you want to check on the test servers? [20:46:49] I assume not [20:46:57] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:47:05] arlolra: be anything, just a usergroup addition [20:47:09] shouldnt be anything* [20:47:14] !log arlolra@deploy1003 arlolra, eggroll97: Continuing with sync [20:48:11] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:48:30] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:48:48] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [20:49:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1153626 (https://phabricator.wikimedia.org/T395204) (owner: 10Gergő Tisza) [20:49:29] (03PS1) 10Btullis: Add our legacy archiva instance to kubernetes external_services [puppet] - 10https://gerrit.wikimedia.org/r/1159563 (https://phabricator.wikimedia.org/T392244) [20:49:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:52:05] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5971/co" [puppet] - 10https://gerrit.wikimedia.org/r/1159563 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis) [20:52:31] (03PS1) 10Dzahn: site: move legacy miscweb VMs to insetup-role [puppet] - 10https://gerrit.wikimedia.org/r/1159564 (https://phabricator.wikimedia.org/T397080) [20:54:12] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155945|Add arbcom group to ukwiki (T396668)]] (duration: 10m 36s) [20:54:16] T396668: Add user group arbcom to ukwiki - https://phabricator.wikimedia.org/T396668 [20:54:26] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:55:05] (03CR) 10Aleksandar Mastilovic: [V:03+1 C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1159563 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis) [20:55:25] ebernhardson: You're up [20:55:51] kk [20:56:24] Do you want me to do it? [20:56:36] sure [20:56:39] PROBLEM - Host thanos-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:45] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:56:57] RECOVERY - Host thanos-be2006 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms [20:57:08] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:57:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920555 (10VRiley-WMF) Okay, I found the problem (I pinged the incorrect IP) I set the IP address on the iDRAC to the one listed in netbox. I just tested out the ping and it s... [20:57:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159520 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [20:59:00] (03Merged) 10jenkins-bot: Turn off glent m1 AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159520 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [20:59:12] (03PS6) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [20:59:14] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1159520|Turn off glent m1 AB test (T262612)]] [20:59:18] T262612: Run an A/B test using suggestions generated using glent Method 1 - https://phabricator.wikimedia.org/T262612 [21:00:04] Reedy, sbassett, Maryum, and manfredi: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T2100). [21:00:30] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-worker1185/1186 - jclark@cumin1002" [21:00:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-worker1185/1186 - jclark@cumin1002" [21:00:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:01:20] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:01:21] !log arlolra@deploy1003 ebernhardson, arlolra: Backport for [[gerrit:1159520|Turn off glent m1 AB test (T262612)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:02:00] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:02:00] arlolra: looks reasonable on the test servers [21:02:08] Thanks [21:02:19] !log arlolra@deploy1003 ebernhardson, arlolra: Continuing with sync [21:02:29] (03PS1) 10Gergő Tisza: Revert "Add scrambled: password class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360) [21:02:51] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1185 [21:02:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1185 [21:03:05] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 [21:03:05] (03CR) 10Gergő Tisza: [C:04-2] "need to wait a week for I8ea7234cf9b470bd180edfaedec31a3220a81bb4 to be fully deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159568 (https://phabricator.wikimedia.org/T395360) (owner: 10Gergő Tisza) [21:03:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 [21:03:37] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:03:46] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T397099 (10DerHexer) 03NEW [21:04:45] 06SRE, 10LDAP-Access-Requests: Grant Access to for DerHexer - https://phabricator.wikimedia.org/T397099#10920591 (10DerHexer) [21:04:56] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:05:06] !log aqu@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:05:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:05:48] 06SRE, 10LDAP-Access-Requests: Grant Access to for DerHexer - https://phabricator.wikimedia.org/T397099#10920594 (10Astinson) DerHexer is a long-trusted Steward that wants access to some of the data that is available through Central Notice, he has an existing NDA with the Foundation. [21:06:06] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2006-dev.codfw.wmnet with OS bullseye [21:09:07] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159520|Turn off glent m1 AB test (T262612)]] (duration: 09m 53s) [21:09:11] T262612: Run an A/B test using suggestions generated using glent Method 1 - https://phabricator.wikimedia.org/T262612 [21:10:01] anything still need backports or am i free to sneak in a config patch? [21:10:51] I think we're done. We've bled into the security deployment window, not sure if that's needed today or not though [21:10:51] Hey all - would like to deploy 2 security patches during the window. Has the backport window wrapped up? [21:11:21] PROBLEM - Host thanos-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [21:12:16] sbassett, bvibber: I will leave it to you to sort out [21:13:02] go for it [21:13:18] mine's an experimental feature we're rolling out wider for testing, no rush on it :) [21:13:37] (03PS2) 10Btullis: Add our legacy archiva instance to kubernetes external_services [puppet] - 10https://gerrit.wikimedia.org/r/1159563 (https://phabricator.wikimedia.org/T392244) [21:15:50] (03PS1) 10Btullis: Allow blunderbuss to contact archiva [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159579 (https://phabricator.wikimedia.org/T392244) [21:15:56] bvibber: sounds good, thanks. this should really only take 20 mins or so, so there’d be time after I’d be happy to turn back over to you. [21:16:12] awesome :) [21:16:47] (03PS1) 10Tchanders: Revert "Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159580 [21:17:09] (03PS2) 10Tchanders: Revert "Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159580 (https://phabricator.wikimedia.org/T376315) [21:18:51] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10920618 (10Astinson) [21:20:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:20:24] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:20:41] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:21:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54083 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:21:32] (03PS7) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [21:21:58] (03CR) 10CI reject: [V:04-1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [21:26:46] (03PS8) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [21:27:11] (03CR) 10CI reject: [V:04-1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [21:28:20] (03PS9) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [21:28:45] (03CR) 10CI reject: [V:04-1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [21:30:45] (03PS10) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [21:31:10] (03CR) 10CI reject: [V:04-1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [21:32:36] !log Deployed security fix for T396946 [21:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:43] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1159579 (https://phabricator.wikimedia.org/T392244) (owner: 10Btullis) [21:37:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:37:54] (03PS11) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [21:38:54] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:39:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:40:04] (03CR) 10CI reject: [V:04-1] centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [21:41:52] bvibber: ok, all done. feel free to use the rest of the sec deployment window. [21:42:17] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:42:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:45:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:45:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:46:34] tx [21:46:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159559 (https://phabricator.wikimedia.org/T388616) (owner: 10Bvibber) [21:47:02] (03PS12) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [21:47:34] (03Merged) 10jenkins-bot: Quiet test rollout of Lua transforms for Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159559 (https://phabricator.wikimedia.org/T388616) (owner: 10Bvibber) [21:51:12] anybody know what's failing with the scap? [21:51:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye [21:51:31] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10920716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2006.codfw.wmnet with OS bu... [21:52:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10920717 (10BCornwall) Thank you! [21:53:47] (03PS1) 10BCornwall: Revert^2 "hiera: Add lvs1016 to high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/1159592 [21:54:37] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1159559|Quiet test rollout of Lua transforms for Charts (T388616)]] [21:54:41] T388616: Expose Data: Lua filter interface to Charts via the .chart format setup - https://phabricator.wikimedia.org/T388616 [21:55:39] (we found it) [21:56:34] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1159559|Quiet test rollout of Lua transforms for Charts (T388616)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:57:31] !log bvibber@deploy1003 bvibber: Continuing with sync [21:58:05] RECOVERY - Host thanos-be2006 is UP: PING WARNING - Packet loss = 71%, RTA = 37.23 ms [22:00:43] (03PS13) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [22:04:17] (03PS2) 10Ryan Kemper: wdqs: fork SLOs for wdqs-main and wdqs-scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) [22:04:56] (03CR) 10Ryan Kemper: wdqs: fork SLOs for wdqs-main and wdqs-scholarly (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [22:05:00] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159559|Quiet test rollout of Lua transforms for Charts (T388616)]] (duration: 10m 22s) [22:05:05] T388616: Expose Data: Lua filter interface to Charts via the .chart format setup - https://phabricator.wikimedia.org/T388616 [22:06:06] jhancock@cumin2002 reimage (PID 676792) is awaiting input [22:07:37] (03PS3) 10Ryan Kemper: wdqs: fork SLOs for wdqs-main and wdqs-scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) [22:08:43] all done [22:09:13] (03CR) 10Ryan Kemper: "Also added the comment to explain the queries in plain english, hope it makes some sort of sense :P" [puppet] - 10https://gerrit.wikimedia.org/r/1155335 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [22:11:54] (03PS14) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [22:12:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.448s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:14:56] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1185 [22:15:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1185 [22:15:10] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 [22:15:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 [22:16:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:17:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.272s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:20:33] (03CR) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [22:21:34] (03CR) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [22:34:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:35:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [22:35:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10920824 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [22:42:32] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA LDAP for DerHexer - https://phabricator.wikimedia.org/T397099#10920833 (10IAckerman-WMF) I support DerHexer's NDA LDAP access so they can evaluate their fundraising banner performance. [22:48:55] (03PS1) 10Arlolra: Undeploy VipsScaler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159599 (https://phabricator.wikimedia.org/T290759) [22:51:34] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye [22:51:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10920858 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [22:55:41] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye [22:55:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10920878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1185.eqiad.wmnet with OS b... [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250616T2300) [23:00:10] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [23:05:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (95.01%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [23:10:35] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1185.eqiad.wmnet with reason: host reimage [23:14:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1185.eqiad.wmnet with reason: host reimage [23:31:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye [23:31:26] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:31:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:31:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10920931 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host thanos-be2006.codfw.wmnet with OS bullse... [23:31:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1185.eqiad.wmnet with OS bullseye [23:32:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10920932 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1185.eqiad.wmnet with OS bulls... [23:36:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10920933 (10Jhancock.wm) @MatthewVernon I have tried cryptographically wiping the drives but I still can't get a puppet run to complete on these two... [23:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1159608 [23:38:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1159608 (owner: 10TrainBranchBot) [23:47:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:50:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1159608 (owner: 10TrainBranchBot) [23:56:56] (03PS1) 10Bartosz Dziewoński: Simplify $wgContactConfig required checkboxes validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159610