[00:08:38] FIRING: [3x] GnmiTargetDown: cr2-eqord is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [00:10:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:28:19] !log rzl@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/sophroid: apply [00:28:33] !log rzl@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/sophroid: apply [00:29:22] !log rzl@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/sophroid: apply [00:29:36] !log rzl@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/sophroid: apply [00:34:16] PROBLEM - MegaRAID on an-worker1148 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:40:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1237364 [00:40:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1237364 (owner: 10TrainBranchBot) [00:53:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1237364 (owner: 10TrainBranchBot) [00:59:17] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [01:10:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1237367 [01:10:49] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1237367 (owner: 10TrainBranchBot) [01:12:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1237367 (owner: 10TrainBranchBot) [02:00:54] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:13:55] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 00s) [02:33:20] ryankemper@cumin2002 reboot-workers (PID 1893527) is awaiting input [02:34:17] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:34] PROBLEM - SSH on an-worker1175 is CRITICAL: connect to address 10.64.53.17 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:46:16] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 11%, RTA = 5399.23 ms [02:48:54] PROBLEM - Host an-worker1175 is DOWN: PING CRITICAL - Packet loss = 100% [02:48:54] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [02:49:16] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:50:12] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:53:56] RECOVERY - Host an-worker1175 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [03:02:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Follow-up: Degraded Disk Not Yet Added to RAID (an-worker1175, an-worker1199) - https://phabricator.wikimedia.org/T416166#11590231 (10RKemper) Just took care of `an-worker1175`: ` root@an-worker1175:~# perccli64 /c0 add vd r0 dri... [03:03:34] RECOVERY - SSH on an-worker1175 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:04:06] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1175 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [03:04:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Follow-up: Degraded Disk Not Yet Added to RAID (an-worker1175, an-worker1199) - https://phabricator.wikimedia.org/T416166#11590232 (10RKemper) `an-worker1199` still remains to be done; looks like it's waiting for another drive swap. [03:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [03:29:17] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:08:38] FIRING: [3x] GnmiTargetDown: cr2-eqord is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [04:10:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:58] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [04:19:08] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [04:40:09] (03PS1) 10C. Scott Ananian: Disable magic links on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237373 (https://phabricator.wikimedia.org/T145604) [04:59:17] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [05:12:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:07] PROBLEM - Host an-worker1148 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:37] ryankemper@cumin2002 reboot-workers (PID 2027564) is awaiting input [05:56:59] RECOVERY - Host an-worker1148 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [06:04:15] 10ops-eqiad, 06SRE, 06DC-Ops, 07Essential-Work: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11590292 (10RKemper) Upon rebooting the host, we're back to the same issue: ` F2 = System Setup F10 = Lifecycle Controller F11 = Boot Manager F12 = P... [06:04:17] RECOVERY - MegaRAID on an-worker1148 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:34:17] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:48:18] (03PS1) 10Marostegui: db1155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1237380 [06:48:54] (03CR) 10Marostegui: [C:03+2] db1155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1237380 (owner: 10Marostegui) [06:49:17] RESOLVED: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:50:56] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260206T0700) [07:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [07:15:59] (03CR) 10Arnaudb: gerrit: allow `replication` when in readonly mode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1237259 (owner: 10Hashar) [07:29:17] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260206T0800) [08:05:01] (03PS1) 10Muehlenhoff: Use Bird 2.18 for all cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1237442 (https://phabricator.wikimedia.org/T413740) [08:08:17] (03CR) 10Muehlenhoff: [C:03+2] Use Bird 2.18 for all cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1237442 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [08:08:38] FIRING: [3x] GnmiTargetDown: cr2-eqord is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [08:10:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:56] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml: add vLLM 0.14 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1237060 (https://phabricator.wikimedia.org/T415627) (owner: 10Kevin Bazira) [08:40:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1234358 (https://phabricator.wikimedia.org/T398214) (owner: 10Majavah) [08:59:17] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [09:12:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:44] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host pki1002.eqiad.wmnet with OS trixie [09:33:10] (03PS4) 10IKhitron: Remove the wgGlobalWatchlistWikibaseSite variable values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235499 (https://phabricator.wikimedia.org/T415440) [09:33:10] (03CR) 10IKhitron: "Thank you. Sorry I missed your answer earlier." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235499 (https://phabricator.wikimedia.org/T415440) (owner: 10IKhitron) [09:42:19] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pki1002.eqiad.wmnet with reason: host reimage [09:47:00] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:47:44] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki1002.eqiad.wmnet with reason: host reimage [09:48:26] ² [09:51:31] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11590594 (10elukey) 05Open→03Resolved a:03elukey Closing since I don't see more actions left, please re-open if needed! Thanks to all :) [09:56:19] (03PS1) 10JavierMonton: component: New Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237451 (https://phabricator.wikimedia.org/T360794) [09:57:00] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:58:58] (03CR) 10JavierMonton: component: New Stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237451 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [10:05:04] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki1002.eqiad.wmnet with OS trixie [10:06:38] (03PS1) 10Muehlenhoff: Bullseye tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/1237454 [10:07:12] (03PS5) 10Daniel Kinzler: rest gateway: include service values.yaml when testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 [10:09:06] (03CR) 10Muehlenhoff: [C:03+2] Bullseye tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/1237454 (owner: 10Muehlenhoff) [10:09:35] (03CR) 10JMeybohm: k8s-staging: Switch to IPIP mode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [10:12:46] (03CR) 10Majavah: [C:03+2] openldap: offboard-user: Query list of Cloud VPS root keys [puppet] - 10https://gerrit.wikimedia.org/r/1234358 (https://phabricator.wikimedia.org/T398214) (owner: 10Majavah) [10:21:13] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Access to systems owned by data platform engineering team for Jerry Wang - https://phabricator.wikimedia.org/T416191#11590673 (10Gehel) 05Invalid→03Open p:05Triage→03High [10:26:23] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1006.eqiad.wmnet with OS bookworm [10:26:51] !log ayounsi@cumin1003 START - Cookbook sre.hosts.move-vlan for host aux-k8s-worker1006 [10:27:07] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [10:27:23] (03CR) 10Joal: [C:03+1] "Minimal comments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237451 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [10:28:16] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [10:32:08] (03CR) 10JMeybohm: "You sure? From PCC it seems that the package is not installed and ferm rules are not created (absent)." [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [10:32:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:32:54] ayounsi@cumin1003 reimage (PID 3682647) is awaiting input [10:33:08] 06SRE: Upgrade Kafka to version 3.5 - https://phabricator.wikimedia.org/T416669 (10elukey) 03NEW [10:33:55] (03PS1) 10Muehlenhoff: Remove puppet/config-master records pointint to puppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/1237463 (https://phabricator.wikimedia.org/T416606) [10:35:16] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster2001 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1230332 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:35:48] (03PS1) 10Daniel Kinzler: re-apply: rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237464 [10:36:20] 06SRE: Package Confluent Platform 7.5.x / Kafka 3.5 - https://phabricator.wikimedia.org/T416670 (10elukey) 03NEW [10:37:43] (03PS1) 10Ayounsi: Remove LibreNMS as syslog target [homer/public] - 10https://gerrit.wikimedia.org/r/1237465 (https://phabricator.wikimedia.org/T415270) [10:38:17] (03CR) 10Cathal Mooney: [C:03+1] Remove LibreNMS as syslog target [homer/public] - 10https://gerrit.wikimedia.org/r/1237465 (https://phabricator.wikimedia.org/T415270) (owner: 10Ayounsi) [10:39:51] (03PS1) 10Muehlenhoff: Remove puppet references for puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/1237466 (https://phabricator.wikimedia.org/T416606) [10:41:20] (03CR) 10Kamila Součková: [C:03+2] rest gateway: include service values.yaml when testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 (owner: 10Daniel Kinzler) [10:43:15] (03Merged) 10jenkins-bot: rest gateway: include service values.yaml when testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 (owner: 10Daniel Kinzler) [10:44:33] 06SRE: Test and upgrade Kafka clusters to Openjdk 17 - https://phabricator.wikimedia.org/T416674 (10elukey) 03NEW [10:46:30] (03CR) 10Muehlenhoff: [C:03+2] Remove puppet references for puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/1237466 (https://phabricator.wikimedia.org/T416606) (owner: 10Muehlenhoff) [10:53:58] PROBLEM - Kafka Broker Server on kafka-test1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [10:54:04] PROBLEM - Kafka broker TLS certificate validity on kafka-test1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [10:54:53] (03PS2) 10Daniel Kinzler: re-apply: rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237464 [10:55:27] kafka-test1006 is me [10:56:10] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [10:59:26] (03CR) 10Vgutierrez: "that's because this CR is missing `profile::lvs::realserver::ipip::enabled: true`" [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [11:00:34] 06SRE, 10Datasets-General-or-Unknown, 06tools-infrastructure-team: Move internal dumps NFS clients to clouddumps1001 - https://phabricator.wikimedia.org/T416677 (10taavi) 03NEW [11:01:51] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission puppetmaster2001 - https://phabricator.wikimedia.org/T416606#11590866 (10MoritzMuehlenhoff) [11:03:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:05:12] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [11:05:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:07:14] (03PS1) 10Vgutierrez: lvs: Allow disabling TCP MSS clamping for IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1237467 (https://phabricator.wikimedia.org/T352956) [11:08:58] RECOVERY - Kafka Broker Server on kafka-test1006 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [11:09:04] RECOVERY - Kafka broker TLS certificate validity on kafka-test1006 is OK: SSL OK - Certificate kafka-test1006.eqiad.wmnet valid until 2026-08-23 08:32:00 +0000 (expires in 197 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [11:09:18] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237467 (https://phabricator.wikimedia.org/T352956) (owner: 10Vgutierrez) [11:10:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:54] cmooney@cumin1003 netbox (PID 3686236) is awaiting input [11:11:52] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update entries for private1-d8-eqiad gateway IPs - cmooney@cumin1003" [11:11:57] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update entries for private1-d8-eqiad gateway IPs - cmooney@cumin1003" [11:11:57] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:13:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:13:50] (03CR) 10Vgutierrez: "This CR needs to be rebased on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237467 and set the following hiera keys:" [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [11:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [11:16:56] 06SRE: Test and upgrade Kafka clusters to Openjdk 17 - https://phabricator.wikimedia.org/T416674#11590914 (10elukey) On kafka-test1006 I manually installed openjdk/jre 17 and modified JAVA_HOME in /etc/default/kafka. I found the following issues before being able to start the broker: `/usr/bin/kafka-run-class`:... [11:17:34] (03CR) 10Fabfur: [C:03+1] "logic looks ok to me" [puppet] - 10https://gerrit.wikimedia.org/r/1237467 (https://phabricator.wikimedia.org/T352956) (owner: 10Vgutierrez) [11:18:02] (03PS1) 10Trueg: rdf-streaming-updater: testing deployment with flink 1.20.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237468 (https://phabricator.wikimedia.org/T414430) [11:18:15] 06SRE: Upgrade Kafka to version 3.5 - https://phabricator.wikimedia.org/T416669#11590918 (10elukey) [11:29:17] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:32:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:53:42] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [11:56:29] 06SRE, 10Bitu, 06Infrastructure-Foundations: Add a "under maintenance" mode for Bitu - https://phabricator.wikimedia.org/T416685 (10MoritzMuehlenhoff) 03NEW [11:56:51] 06SRE, 10Bitu, 06Infrastructure-Foundations: Add a "under maintenance" mode for Bitu - https://phabricator.wikimedia.org/T416685#11591028 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260206T0800) [12:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GitLab version upgrades . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260206T1200). [12:04:08] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [12:04:52] (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias for staging maps node(s) [puppet] - 10https://gerrit.wikimedia.org/r/1235810 (owner: 10Muehlenhoff) [12:05:57] (03PS2) 10Muehlenhoff: pcc_update_facts: Rename variables [puppet] - 10https://gerrit.wikimedia.org/r/1227734 (https://phabricator.wikimedia.org/T365798) [12:07:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1236250 (https://phabricator.wikimedia.org/T412420) (owner: 10Slyngshede) [12:08:01] (03CR) 10Slyngshede: [C:03+2] LDAP: Use the escaping mechanism provided by LDAP3 [software/bitu] - 10https://gerrit.wikimedia.org/r/1236250 (https://phabricator.wikimedia.org/T412420) (owner: 10Slyngshede) [12:08:38] FIRING: [3x] GnmiTargetDown: cr2-eqord is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [12:10:45] (03Merged) 10jenkins-bot: LDAP: Use the escaping mechanism provided by LDAP3 [software/bitu] - 10https://gerrit.wikimedia.org/r/1236250 (https://phabricator.wikimedia.org/T412420) (owner: 10Slyngshede) [12:11:21] (03PS1) 10Effie Mouzeli: mw-parsoid: repurpose for parsoidtest use [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237472 (https://phabricator.wikimedia.org/T386246) [12:12:44] (03CR) 10Lerickson: [C:03+1] rdf-streaming-updater: testing deployment with flink 1.20.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237468 (https://phabricator.wikimedia.org/T414430) (owner: 10Trueg) [12:21:10] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:21:14] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:22:10] FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and fe80::8618:88ff:fe0d:dc64 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:22:10] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:22:14] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:22:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:23:34] 06SRE, 06ServiceOps new, 07Kubernetes: New WMF docker registry credentials - https://phabricator.wikimedia.org/T412524#11591111 (10DPogorzelski-WMF) 05Open→03Resolved a:03DPogorzelski-WMF [12:27:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-drmrs and fe80::8618:88ff:fe0d:dc64 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:27:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:28:01] (03CR) 10Muehlenhoff: "Looks good, two nits inline." [software/bitu] - 10https://gerrit.wikimedia.org/r/1236298 (https://phabricator.wikimedia.org/T416152) (owner: 10Slyngshede) [12:29:31] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: testing deployment with flink 1.20.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237468 (https://phabricator.wikimedia.org/T414430) (owner: 10Trueg) [12:31:24] (03Merged) 10jenkins-bot: rdf-streaming-updater: testing deployment with flink 1.20.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237468 (https://phabricator.wikimedia.org/T414430) (owner: 10Trueg) [12:31:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11591130 (10cmooney) @VRiley-WMF great work on these moves yesterday, I don't think it could have gone much smoother tbh :) In terms of the cable label... [12:33:28] (03PS1) 10Slyngshede: Permissions: Format HTML emails [software/bitu] - 10https://gerrit.wikimedia.org/r/1237473 (https://phabricator.wikimedia.org/T416565) [12:40:04] (03PS1) 10Trueg: rdf-streaming-updater: fixed docker img tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237474 (https://phabricator.wikimedia.org/T414430) [12:40:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [12:40:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [12:40:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:41:01] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: fixed docker img tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237474 (https://phabricator.wikimedia.org/T414430) (owner: 10Trueg) [12:42:36] (03PS1) 10Jelto: gitlab: move mid-day backup and resture out of maintenance window [puppet] - 10https://gerrit.wikimedia.org/r/1237475 (https://phabricator.wikimedia.org/T416687) [12:42:51] (03Merged) 10jenkins-bot: rdf-streaming-updater: fixed docker img tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237474 (https://phabricator.wikimedia.org/T414430) (owner: 10Trueg) [12:43:08] (03PS1) 10Muehlenhoff: Remove the Puppet 5 CA cert from the cert bundle [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/1237476 (https://phabricator.wikimedia.org/T415255) [12:43:30] (03PS2) 10Jelto: gitlab: move mid-day backup out of maintenance window [puppet] - 10https://gerrit.wikimedia.org/r/1237475 (https://phabricator.wikimedia.org/T416687) [12:45:11] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7999/co" [puppet] - 10https://gerrit.wikimedia.org/r/1237475 (https://phabricator.wikimedia.org/T416687) (owner: 10Jelto) [12:46:52] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device fasw2-e16a-eqiad [12:47:02] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-e16a-eqiad [12:47:10] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device fasw2-e16b-eqiad [12:47:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-e16b-eqiad [12:47:29] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device fasw2-e15a-eqiad [12:47:40] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-e15a-eqiad [12:47:48] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device fasw2-e15b-eqiad [12:47:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-e15b-eqiad [12:48:31] (03CR) 10Arnaudb: [C:03+1] gitlab: move mid-day backup out of maintenance window [puppet] - 10https://gerrit.wikimedia.org/r/1237475 (https://phabricator.wikimedia.org/T416687) (owner: 10Jelto) [12:48:37] (03PS2) 10Slyngshede: Permission: command for expiring permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1236298 (https://phabricator.wikimedia.org/T416152) [12:49:58] !log trueg@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [12:50:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [12:50:44] Deployment mw-web.eqiad.main in mw-web at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.eqiad.main - ... [12:50:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:50:51] !log trueg@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [12:52:10] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-eqord [12:52:26] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqord [12:53:23] FIRING: [3x] GnmiTargetDown: cr2-eqord is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [12:54:44] (03PS1) 10Trueg: rdf-streaming-updater: fixed ns on new docker img [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237481 (https://phabricator.wikimedia.org/T414430) [12:55:40] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: fixed ns on new docker img [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237481 (https://phabricator.wikimedia.org/T414430) (owner: 10Trueg) [12:57:30] (03Merged) 10jenkins-bot: rdf-streaming-updater: fixed ns on new docker img [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237481 (https://phabricator.wikimedia.org/T414430) (owner: 10Trueg) [12:58:21] !log trueg@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [12:58:23] RESOLVED: [3x] GnmiTargetDown: cr2-eqord is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [12:58:31] !log trueg@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [12:59:17] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [13:03:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1236298 (https://phabricator.wikimedia.org/T416152) (owner: 10Slyngshede) [13:04:48] (03CR) 10Slyngshede: [C:03+2] Permission: command for expiring permission requests (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1236298 (https://phabricator.wikimedia.org/T416152) (owner: 10Slyngshede) [13:06:10] (03PS1) 10Muehlenhoff: Mention expiry of permission requests in help text [software/bitu] - 10https://gerrit.wikimedia.org/r/1237483 (https://phabricator.wikimedia.org/T416152) [13:07:12] (03Merged) 10jenkins-bot: Permission: command for expiring permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1236298 (https://phabricator.wikimedia.org/T416152) (owner: 10Slyngshede) [13:11:24] (03PS1) 10Trueg: rdf-streaming-updater: fixed typo in docker img version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237484 (https://phabricator.wikimedia.org/T414430) [13:12:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [13:12:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [13:15:32] (03CR) 10Gmodena: [C:03+2] "LGTM. Version matches the latest image on registry." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237484 (https://phabricator.wikimedia.org/T414430) (owner: 10Trueg) [13:17:08] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-codfw alarm: FPC 5 power is unstable - https://phabricator.wikimedia.org/T416691 (10cmooney) 03NEW p:05Triage→03High [13:17:29] (03Merged) 10jenkins-bot: rdf-streaming-updater: fixed typo in docker img version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237484 (https://phabricator.wikimedia.org/T414430) (owner: 10Trueg) [13:19:34] !log trueg@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:19:44] !log trueg@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:20:09] 06SRE, 10DNS, 06Traffic-Icebox: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability - https://phabricator.wikimedia.org/T162818#11591270 (10MLechvien-WMF) @BBlack can this be closed? I don't see action for Service ops so removing... [13:21:35] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11591275 (10tappof) [13:22:00] (03CR) 10Tiziano Fogli: [C:03+2] admin: add user lerickson to analytics-privatedata, wdqs-{root,admins} [puppet] - 10https://gerrit.wikimedia.org/r/1237248 (https://phabricator.wikimedia.org/T415406) (owner: 10Elukey) [13:24:34] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11591282 (10tappof) 05In progress→03Resolved Access granted. [13:24:47] 06SRE, 10DNS, 06Traffic-Icebox: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability - https://phabricator.wikimedia.org/T162818#11591286 (10MoritzMuehlenhoff) 05Open→03Invalid We can close this, there has been a decade of... [13:30:23] (03CR) 10Tiziano Fogli: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1233181 (https://phabricator.wikimedia.org/T415373) (owner: 10Gehel) [13:32:46] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1237483 (https://phabricator.wikimedia.org/T416152) (owner: 10Muehlenhoff) [13:33:11] (03Abandoned) 10Tiziano Fogli: admin(query-service): wdqs shell access for user lerickson [puppet] - 10https://gerrit.wikimedia.org/r/1233181 (https://phabricator.wikimedia.org/T415373) (owner: 10Gehel) [13:34:55] 06SRE, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Request: wdqs shell access for user lerickson - https://phabricator.wikimedia.org/T415373#11591328 (10tappof) 05Open→03Resolved a:03tappof Access granted via the patch provided in {T415406}. [13:38:02] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Nicholusmuwonge - https://phabricator.wikimedia.org/T416494#11591376 (10tappof) The NDA signature process can be followed on T416592#11588973. [13:39:28] (03CR) 10Elukey: [C:03+1] Remove the Puppet 5 CA cert from the cert bundle [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/1237476 (https://phabricator.wikimedia.org/T415255) (owner: 10Muehlenhoff) [13:51:19] (03PS2) 10JavierMonton: component: mediawiki.page_html_content_change.dev0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237451 (https://phabricator.wikimedia.org/T360794) [13:56:46] 06SRE, 10LDAP-Access-Requests: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358#11591466 (10tappof) [14:01:22] (03PS1) 10Hashar: TypeError: Unsupported operand types: array + null [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237489 (https://phabricator.wikimedia.org/T416619) [14:02:41] 06SRE, 10LDAP-Access-Requests: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358#11591496 (10tappof) Hello @Jacob_WMDE, I’m unable to see the production shell access account you mentioned in our list, nor was I able to find a task requesting it. Have you co... [14:03:50] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-codfw alarm: FPC 5 power is unstable - https://phabricator.wikimedia.org/T416691#11591499 (10ayounsi) If we only do `request chassis fpc offline slot 5` it will come back up automatically. Not tested but I think we need to set `cr2-codfw# set chassis fpc 5 p... [14:05:57] (03CR) 10Muehlenhoff: [C:03+2] Mention expiry of permission requests in help text [software/bitu] - 10https://gerrit.wikimedia.org/r/1237483 (https://phabricator.wikimedia.org/T416152) (owner: 10Muehlenhoff) [14:07:55] (03PS2) 10Muehlenhoff: docker: Remove check for memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/1223184 [14:09:42] moritzm: federico3: hello, I'd like to deploy a hotfix for mediawiki/core . It causes the user toolbar to not be updated on old skins and affects ruwiki. The patch is quite straightforward (prevents `array + null` which results in an error). [14:09:42] The patch is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1237489 , I can do the deployment [14:09:51] 06SRE: Upgrade Kafka to version 3.5 - https://phabricator.wikimedia.org/T416669#11591526 (10elukey) p:05Triage→03Medium [14:10:08] 06SRE: Package Confluent Platform 7.5.x / Kafka 3.5 - https://phabricator.wikimedia.org/T416670#11591527 (10elukey) p:05Triage→03Medium [14:10:18] 06SRE: Test and upgrade Kafka clusters to Openjdk 17 - https://phabricator.wikimedia.org/T416674#11591528 (10elukey) p:05Triage→03Medium [14:10:20] I forgot: I am pinging you as SREs on call cause friday deployment is subject to SRE approval per https://wikitech.wikimedia.org/wiki/Deployments/Emergencies ;) [14:11:22] 06SRE, 10LDAP-Access-Requests: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358#11591529 (10Dzahn) @tappof I don't think this is asking for shell access. It's just for LDAP groups "wmde" and "nda" which is usually the standard for WMDE staff. [14:11:53] looking [14:12:10] (03PS1) 10Muehlenhoff: analytics::cluster::client: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1237492 [14:12:11] (03PS1) 10Muehlenhoff: analytics::cluster::packages::common: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1237493 [14:12:19] thanks! :) the code is straightforward and got reviewed by MediaWiki [14:12:24] 06SRE: Test and upgrade Kafka clusters to Openjdk 17 - https://phabricator.wikimedia.org/T416674#11591532 (10elukey) We can just use https://github.com/apache/kafka/commit/c34f3d066ead40d8c0bca0cf92d4226d2d6416c6 :) [14:13:18] I guess the process is there in case there is a side effect on WikiKube, though that is hopefully unlikely. [14:14:33] (03PS1) 10Muehlenhoff: kafkatee::webrequest::ops: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/1237494 [14:14:54] hashar: o/ the usual question is - can it wait Monday? Is there anything that severely impacts users? [14:15:31] I'll quote Gergo from the task: [14:15:31] > the impact is your user toolbar not being updated when you get logged in automatically. That is not very significant, no need to block deployment because of it. SkinTemplate might error out in other situations though. [14:15:58] so "not very significant", I felt I could get rid of the error ;] [14:16:40] hashar: the patch looks ok, can you elaborate on what testing has been done? [14:17:33] 06SRE: Package Confluent Platform 7.5.x / Kafka 3.5 - https://phabricator.wikimedia.org/T416670#11591574 (10elukey) [14:19:20] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11591581 (10Ladsgroup) >>! In T414805#11589974, @Tacsipacsi wrote: > #mediaviewer got broken by this. ☹ For example, https://commons.... [14:20:16] federico3: no idea, I was not involved in the elaboration of the patch. Per the task that got reproduced and the fix confirmed to get rid of the error [14:20:25] they skip the + operation when the variable is null [14:20:37] (03PS1) 10BBlack: Update bblack ssh keys, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1237497 [14:20:37] (03PS1) 10BBlack: Update bblack ssh keys, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1237498 [14:20:59] (03PS1) 10Muehlenhoff: Remove HPSA RAID support [puppet] - 10https://gerrit.wikimedia.org/r/1237499 [14:22:05] (03CR) 10BBlack: [C:03+2] Update bblack ssh keys, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/1237497 (owner: 10BBlack) [14:22:15] that is due to a change in the legacy Vector skin [14:22:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237492 (owner: 10Muehlenhoff) [14:22:48] I'm not finding related unit tests [14:24:06] 🤷 :] [14:24:40] indeed, that code is a couple decade old probably [14:25:42] the fix is fine, that is guaranteed, that got reviewed by like 5 very senior people [14:26:07] I think the question is more about what could wrong on wikikube and whether there is bandwith to deal with potential issues that could arise [14:27:03] (03Abandoned) 10Dzahn: admin/nagios/wmcs: offboard Alex (akosiaris) [puppet] - 10https://gerrit.wikimedia.org/r/1235066 (owner: 10Dzahn) [14:27:12] (I am just guessing, I am not familiar with that emergency procedure) [14:27:55] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11591625 (10Ladsgroup) This image is in a standard size and passes through our rate limit: https://upload.wikimedia.org/wikipedia/com... [14:29:28] turns out code for building personal urls when logged in has 0 test coverage https://doc.wikimedia.org/cover/mediawiki-core/includes/Skin/SkinTemplate.php.html#424 ;) [14:32:13] (03PS1) 10Bking: opensearch-test: Use latest image, renew cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237501 (https://phabricator.wikimedia.org/T415699) [14:32:41] (03PS1) 10Elukey: confluent::kafka::common: allow using Kafka 1.1 with Openjdk 17 [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) [14:33:10] (03CR) 10CI reject: [V:04-1] confluent::kafka::common: allow using Kafka 1.1 with Openjdk 17 [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [14:33:54] (03PS2) 10Elukey: confluent::kafka::common: allow using Kafka 1.1 with Openjdk 17 [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) [14:34:10] (03CR) 10Bking: [C:03+2] opensearch-test: Use latest image, renew cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237501 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [14:34:24] (03CR) 10CI reject: [V:04-1] confluent::kafka::common: allow using Kafka 1.1 with Openjdk 17 [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [14:34:24] (03CR) 10Bking: [C:03+2] "self-merging, as this is a test environment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237501 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [14:34:39] hashar: yet it has been deployed and tested manually? If so I think we can go ahead with the backport [14:34:39] (03PS3) 10Elukey: confluent::kafka::common: allow using Kafka 1.1 with Openjdk 17 [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) [14:35:06] federico3: yeah they reproduced it on their local machine ;) [14:35:09] (03CR) 10CI reject: [V:04-1] confluent::kafka::common: allow using Kafka 1.1 with Openjdk 17 [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [14:35:12] I'll handle it, thank you federico3 ! [14:35:14] (03CR) 10Dzahn: [C:03+1] gitlab: move mid-day backup out of maintenance window [puppet] - 10https://gerrit.wikimedia.org/r/1237475 (https://phabricator.wikimedia.org/T416687) (owner: 10Jelto) [14:35:19] (03Abandoned) 10Bking: opensearch-cluster: Replace reload certificates API call with hot reload setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218834 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [14:35:31] I'll do the usual after deployment routine of checking the dashboards [14:35:34] thanks! [14:35:59] (03Merged) 10jenkins-bot: opensearch-test: Use latest image, renew cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237501 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [14:36:04] federico3: oh i forgot, may you please Code-Review +1 the hotfix at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1237489 ? Thx! [14:36:07] (03PS4) 10Elukey: confluent::kafka::common: allow using Kafka 1.1 with Openjdk 17 [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) [14:36:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1237473 (https://phabricator.wikimedia.org/T416565) (owner: 10Slyngshede) [14:36:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1223184 (owner: 10Muehlenhoff) [14:36:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237493 (owner: 10Muehlenhoff) [14:37:52] (03PS1) 10Ladsgroup: changeprop-jobqueue: Remove thumbnail render job concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237504 (https://phabricator.wikimedia.org/T415282) [14:38:55] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [14:39:04] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [14:39:05] 06SRE: (Potential) Pod Distribution Tracking and Rebalancing Strategy - https://phabricator.wikimedia.org/T413070#11591672 (10Aklapper) [14:39:20] (03CR) 10Federico Ceratto: [C:03+1] "This is a backport of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1237272 which I'm told was tested manually. Currently that code se" [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237489 (https://phabricator.wikimedia.org/T416619) (owner: 10Hashar) [14:39:27] (03CR) 10BBlack: [C:03+2] Update bblack ssh keys, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1237498 (owner: 10BBlack) [14:39:44] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [14:39:54] @hashar replied :) [14:39:55] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [14:40:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237494 (owner: 10Muehlenhoff) [14:40:59] federico3: awesome, and I have mentioned to devs on the task that adding some unit tests to that function would surely help iin the future ( https://phabricator.wikimedia.org/T416619#11591685 ). That was a good point. [14:41:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237489 (https://phabricator.wikimedia.org/T416619) (owner: 10Hashar) [14:42:15] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-test: apply [14:42:22] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [14:44:05] (03PS5) 10Elukey: confluent::kafka::common: allow using Kafka 1.1 with Openjdk 17 [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) [14:44:05] (03PS1) 10Elukey: profile::kafka::broker: allow to force openjdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/1237507 (https://phabricator.wikimedia.org/T416674) [14:45:28] (03Merged) 10jenkins-bot: TypeError: Unsupported operand types: array + null [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1237489 (https://phabricator.wikimedia.org/T416619) (owner: 10Hashar) [14:45:57] (03PS1) 10Elukey: role::kafka::test: force JDK7 and apply missing inter_broker_protocol_version [puppet] - 10https://gerrit.wikimedia.org/r/1237508 (https://phabricator.wikimedia.org/T416674) [14:46:01] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1237489|TypeError: Unsupported operand types: array + null (T416619)]] [14:46:04] T416619: CentralAuth on ruwiki: TypeError: Unsupported operand types: array + null - https://phabricator.wikimedia.org/T416619 [14:46:17] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [14:46:47] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237507 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [14:46:54] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237508 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [14:47:32] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1062:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T416635#11591705 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced Cable [14:48:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11591709 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [14:48:26] (03PS1) 10BBlack: Switch bblack ssh key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/1237509 [14:49:49] (03PS2) 10Elukey: role::kafka::test: force JDK7 and apply missing inter_broker_protocol_version [puppet] - 10https://gerrit.wikimedia.org/r/1237508 (https://phabricator.wikimedia.org/T416674) [14:50:03] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1237508 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [14:50:28] !log hashar@deploy2002 hashar: Backport for [[gerrit:1237489|TypeError: Unsupported operand types: array + null (T416619)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:50:49] (03CR) 10Elukey: "I'll try to test this in pontoon, or maybe in deployment-prep, before merging :)" [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [14:52:33] I am checking ruwiki on debug k8s [14:53:22] !log hashar@deploy2002 hashar: Continuing with sync [14:53:28] * hashar syncs [14:57:24] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1237489|TypeError: Unsupported operand types: array + null (T416619)]] (duration: 11m 23s) [14:57:27] T416619: CentralAuth on ruwiki: TypeError: Unsupported operand types: array + null - https://phabricator.wikimedia.org/T416619 [14:57:38] (03CR) 10Marostegui: [C:03+1] "Any reason why this wasn't merged on Wed?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1236726 (https://phabricator.wikimedia.org/T383674) (owner: 10Federico Ceratto) [14:58:35] federico3: change deployed, I am monitoring the logs / dahsboards etc. Thx! [14:59:48] 06SRE, 10LDAP-Access-Requests: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358#11591728 (10Dzahn) I think the misunderstanding here is that the existing developer account / LDAP user is called "jacobthwaites". That exists and is uidNumber: 101605. Uplo... [15:02:21] (03PS1) 10Ladsgroup: MediaViewer: Adjust bucket sizes with the new thumb standard sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237517 (https://phabricator.wikimedia.org/T412971) [15:02:45] (03PS1) 10Dzahn: admin: add Jacon Thwaites to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/1237518 (https://phabricator.wikimedia.org/T416358) [15:04:03] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358#11591750 (10Dzahn) [15:05:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-product-users, airflow-analytics-product-admins for akhatun - https://phabricator.wikimedia.org/T416703 (10mpopov) 03NEW [15:05:58] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-product-users, airflow-analytics-product-admins for akhatun - https://phabricator.wikimedia.org/T416703#11591772 (10mpopov) As the approving party for both groups (and the person requesting this access), I approve @AKhatun_WMF's membership. [15:07:25] (03CR) 10JavierMonton: component: mediawiki.page_html_content_change.dev0 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237451 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [15:09:17] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:29] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-codfw alarm: FPC 5 power is unstable - https://phabricator.wikimedia.org/T416691#11591784 (10cmooney) >>! In T416691#11591499, @ayounsi wrote: > If we only do `request chassis fpc offline slot 5` it will come back up automatically. Yeah I seen that before a... [15:10:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [15:18:04] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11591828 (10Ladsgroup) I‌ think this should fix it: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1237517 but I‌ nee... [15:22:47] 06SRE, 06collaboration-services, 10Observability-Logging, 06Security-Team, 10Wikimedia-Logstash: Grant sbassett and aranyap expanded logstash access - https://phabricator.wikimedia.org/T416501#11591852 (10Dzahn) [15:23:30] 06SRE, 06collaboration-services, 10Observability-Logging, 06Security-Team, 10Wikimedia-Logstash: Grant sbassett and aranyap expanded logstash access - https://phabricator.wikimedia.org/T416501#11591860 (10Dzahn) Hi, collaboration-services is not particularly related to logstash. I added Observability tea... [15:23:50] (03CR) 10Brouberol: "Could/should we enabled `force_jdk17` to true for kafka-test, to get a more meaningful PCC output?" [puppet] - 10https://gerrit.wikimedia.org/r/1237507 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [15:24:22] (03CR) 10Brouberol: "Nevermind: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237508" [puppet] - 10https://gerrit.wikimedia.org/r/1237507 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [15:24:48] (03CR) 10Brouberol: [C:03+1] profile::kafka::broker: allow to force openjdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/1237507 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [15:24:54] (03CR) 10Brouberol: [C:03+1] role::kafka::test: force JDK7 and apply missing inter_broker_protocol_version [puppet] - 10https://gerrit.wikimedia.org/r/1237508 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [15:25:33] (03PS1) 10Dzahn: admin_ng: add status.wikimedia.org to miscweb TLS extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237524 (https://phabricator.wikimedia.org/T414098) [15:26:20] (03CR) 10Brouberol: [C:03+1] "This looks good. As PCC does not contain the actual file diff, I'd be interested in seeing the change in pontoon if you get to test it the" [puppet] - 10https://gerrit.wikimedia.org/r/1237502 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [15:29:13] 06SRE, 06Infrastructure-Foundations: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707 (10MoritzMuehlenhoff) 03NEW [15:29:15] 06SRE, 06Infrastructure-Foundations: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#11591890 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:29:17] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:33:00] 06SRE, 10SRE-Access-Requests: Requesting access to deployment, analytics-privatedata-users for ASanford-WMF - https://phabricator.wikimedia.org/T416710 (10ASanford-WMF) 03NEW [15:34:17] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:25] 06SRE, 10SRE-Access-Requests: Requesting access to deployment, analytics-privatedata-users for ASanford-WMF - https://phabricator.wikimedia.org/T416710#11591957 (10Rsilvola) Approving as @ASanford-WMF 's manager. [15:39:32] (03CR) 10Hnowlan: [C:03+1] "Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237524 (https://phabricator.wikimedia.org/T414098) (owner: 10Dzahn) [15:39:45] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11591967 (10cmooney) 05Open→03Resolved I'm gonna close this one for now. We have not seen a repeat of this since we have adjusted the config to deal with the ARP resolution bug, t... [15:40:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1237507 (https://phabricator.wikimedia.org/T416674) (owner: 10Elukey) [15:40:59] (03CR) 10JMeybohm: "ah, sneaky typo!" [puppet] - 10https://gerrit.wikimedia.org/r/1237277 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [15:41:08] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux ARP resolution bug on v24.10.x+ - https://phabricator.wikimedia.org/T409178#11591971 (10cmooney) 05Open→03Resolved Closing this. The work-around is working well and we will upgrade the OS version in eqiad over the coming months. [15:43:46] federico3: I have deployed the hotfix, the error is gone and all metrics/dashboards etc looks good. Thank you! [15:44:17] thanks! [15:45:30] (03CR) 10Simon04: [C:03+1] MediaViewer: Adjust bucket sizes with the new thumb standard sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237517 (https://phabricator.wikimedia.org/T412971) (owner: 10Ladsgroup) [15:46:06] 06SRE, 06Infrastructure-Foundations: Nokia L3 bugs [Oct 2025] - https://phabricator.wikimedia.org/T409286#11591995 (10cmooney) 05Open→03Resolved Closing this task. We have a work-around for the ARP issue and the DHCP issue is not affecting us so our migrations have been done. We still need to move eq... [15:47:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11592001 (10cmooney) [15:54:29] (03PS1) 10Bking: opensearch-semantic-search: update to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237526 (https://phabricator.wikimedia.org/T415699) [15:56:12] 06SRE, 10Charts, 07Kubernetes: Kserve helm chart - https://phabricator.wikimedia.org/T416580#11592054 (10MLechvien-WMF) a:03elukey @elukey assigning this to you as you're noted as reviewer on https://wikitech.wikimedia.org/wiki/Helm/Upstream_Charts/kserve [15:57:08] 06SRE, 10Charts, 07Kubernetes: Kserve helm chart - https://phabricator.wikimedia.org/T416580#11592056 (10JMeybohm) We have a rough documentation about our policy and process around adopting upstream helm charts which can be found here: https://wikitech.wikimedia.org/wiki/Kubernetes/Upstream_Helm_charts_polic... [15:57:12] (03CR) 10Herron: [C:03+1] kafkatee::webrequest::ops: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/1237494 (owner: 10Muehlenhoff) [16:00:31] 06SRE, 10Charts, 07Kubernetes: Kserve helm chart - https://phabricator.wikimedia.org/T416580#11592061 (10elukey) a:05elukey→03DPogorzelski-WMF @MLechvien-WMF i am happy to follow up with Dawid and the review the charts etc.., but as Janis pointed out SRE should not lead the efforts to import the chart :)... [16:02:42] (03CR) 10Bking: [C:03+2] opensearch-semantic-search: update to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237526 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [16:03:55] (03CR) 10Herron: [C:03+1] "Sounds like a plan, I don't have a strong opinion re: adjusting firewall it is up to you!" [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [16:04:37] (03PS1) 10Muehlenhoff: Record LDAP access for mcollins [puppet] - 10https://gerrit.wikimedia.org/r/1237528 [16:06:27] 06SRE, 10Observability-Logging, 06Security-Team, 10Wikimedia-Logstash: Grant sbassett and aranyap expanded logstash access - https://phabricator.wikimedia.org/T416501#11592078 (10sbassett) >>! In T416501#11591852, @Dzahn wrote: > Hi, collaboration-services is not particularly related to logstash. I added O... [16:07:17] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for mcollins [puppet] - 10https://gerrit.wikimedia.org/r/1237528 (owner: 10Muehlenhoff) [16:20:12] 06SRE, 10SRE-Access-Requests, 10Observability-Logging, 06Security-Team, 10Wikimedia-Logstash: Grant sbassett and aranyap expanded logstash access - https://phabricator.wikimedia.org/T416501#11592118 (10Dzahn) @sbassett Let's just add the global SRE-access-requests tag (in addition to the team tag). [16:20:19] 06SRE, 10SRE-Access-Requests, 10Observability-Logging, 06Security-Team, 10Wikimedia-Logstash: Grant sbassett and aranyap expanded logstash access - https://phabricator.wikimedia.org/T416501#11592119 (10Dzahn) [16:23:42] 06SRE, 10SRE-Access-Requests, 10Observability-Logging, 06Security-Team, 10Wikimedia-Logstash: Grant sbassett and aranyap expanded logstash access - https://phabricator.wikimedia.org/T416501#11592123 (10Dzahn) >>! In T416501#11592078, @sbassett wrote: > Ok, thanks. Do you know if they have a triage clini... [16:25:53] (03CR) 10JMeybohm: [C:03+1] Remove the Puppet 5 CA cert from the cert bundle [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/1237476 (https://phabricator.wikimedia.org/T415255) (owner: 10Muehlenhoff) [16:27:03] (03PS1) 10Brouberol: growthbook: update startup args of frontend/backend components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237531 (https://phabricator.wikimedia.org/T416662) [16:32:55] (03CR) 10JMeybohm: [C:03+1] docker: Remove check for memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/1223184 (owner: 10Muehlenhoff) [16:34:05] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11592156 (10Tacsipacsi) >>! In T414805#11591625, @Ladsgroup wrote: > Can you get the 429 response body? I don’t get 429 anymore. I c... [16:36:08] (03PS2) 10Brouberol: growthbook: update startup args of frontend/backend components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237531 (https://phabricator.wikimedia.org/T416662) [16:36:56] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11592158 (10Ladsgroup) I have reverted the rate limit for "medium" browser score before the weekend to reduce disruptions to people.... [16:38:08] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237531 (https://phabricator.wikimedia.org/T416662) (owner: 10Brouberol) [16:38:17] (03CR) 10Brouberol: [C:03+2] growthbook: update startup args of frontend/backend components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237531 (https://phabricator.wikimedia.org/T416662) (owner: 10Brouberol) [16:38:37] (03CR) 10Cathal Mooney: [C:03+1] Switch bblack ssh key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/1237509 (owner: 10BBlack) [16:39:12] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, thanks. I will roll out this change Monday as it takes a little while for the automation to go through all the routers/switches." [homer/public] - 10https://gerrit.wikimedia.org/r/1237509 (owner: 10BBlack) [16:39:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:40:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:40:48] (03PS1) 10Brouberol: growthbook: fix image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237535 (https://phabricator.wikimedia.org/T416662) [16:43:30] (03CR) 10Brouberol: [C:03+2] growthbook: fix image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237535 (https://phabricator.wikimedia.org/T416662) (owner: 10Brouberol) [16:44:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [16:45:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [16:59:17] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [17:12:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:16:31] 10SRE-SLO, 06Product Safety and Integrity, 06ServiceOps new, 10iPoid-Service (iPoid 1.0): IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11592312 (10MLechvien-WMF) [17:21:53] 06SRE, 06ServiceOps new, 10Continuous-Integration-Config, 06Release-Engineering-Team (Seen): operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855#11592322 (10MLechvien-WMF) p:05High→03Low Changing priority to Low. The cost of not having CI for production-ima... [17:25:07] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7003*} and A:liberica [17:25:15] 06SRE, 10SRE-Access-Requests: Requesting access to "Community Wishlist" dashboard for hmonroy - https://phabricator.wikimedia.org/T416721 (10KSiebert) 03NEW [17:28:57] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7003*} and A:liberica [17:31:36] (03PS2) 10Dzahn: zuul-web: bind mount /etc/zookeeper/zuul-tls [puppet] - 10https://gerrit.wikimedia.org/r/1237354 (https://phabricator.wikimedia.org/T395938) [17:32:36] (03PS1) 10Xcollazo: EventStreamConfig: Bump product_metrics.web_base* streams to large size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237540 (https://phabricator.wikimedia.org/T416719) [17:37:58] (03CR) 10Dzahn: [C:03+2] zuul-web: bind mount /etc/zookeeper/zuul-tls [puppet] - 10https://gerrit.wikimedia.org/r/1237354 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:41:24] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:43:22] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:44:17] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:45:12] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:46:46] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06Traffic: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944#11592363 (10CDobbins) 05Open→03Resolved [18:01:09] 06SRE, 06Infrastructure-Foundations, 10netops: codfw expansion: configure new Nokia switches in rows E/F - https://phabricator.wikimedia.org/T402590#11592446 (10cmooney) 05Open→03Resolved Things are working ok here now. [18:04:18] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on kubestage2004 - https://phabricator.wikimedia.org/T416726 (10ops-monitoring-bot) 03NEW [18:08:13] 06SRE, 06ServiceOps new, 10Continuous-Integration-Config, 06Release-Engineering-Team (Seen): operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855#11592468 (10bking) I agree with @MLechvien-WMF that this is a nice-to-have only, but I would also like to add that I... [18:08:39] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: WIP [18:09:12] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on zuul2001.codfw.wmnet with reason: WIP [18:12:11] 06SRE, 10Charts, 07Kubernetes: Kserve helm chart - https://phabricator.wikimedia.org/T416580#11592486 (10DPogorzelski-WMF) @JMeybohm will check and update the ticket, cheers! [18:13:13] 06SRE, 10SRE-Access-Requests: Requesting access to "Community Wishlist" dashboard for hmonroy - https://phabricator.wikimedia.org/T416721#11592488 (10Dzahn) Hi, could you link to the dashboard? Thanks! [18:15:02] 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Network devices/patch panel wiring - https://phabricator.wikimedia.org/T382219#11592492 (10cmooney) 05Resolved→03Open @papaul @ayounsi a little heads up here I ended up reverting to my original plan for the fr-tech mgmt connectivity during the migration we j... [18:20:30] (03CR) 10Phuedx: [C:03+1] EventStreamConfig: Bump product_metrics.web_base* streams to large size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237540 (https://phabricator.wikimedia.org/T416719) (owner: 10Xcollazo) [18:20:54] (03PS20) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [18:27:21] (03CR) 10CDobbins: prometheus: add depooled cp* host check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:29:12] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8000/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [18:49:42] (03CR) 10Scott French: "Thanks, Amir!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237504 (https://phabricator.wikimedia.org/T415282) (owner: 10Ladsgroup) [18:50:39] 06SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779#11592604 (10Dzahn) If all the affected lists are private anyways (based on Andre's comment from 2021) - they probably don't need admins or are not used. Is there a real problem being solved by... [18:52:02] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:52:19] (03PS1) 10Dzahn: zuul: properly differentiate between zuul and zookeeper certs [puppet] - 10https://gerrit.wikimedia.org/r/1237543 (https://phabricator.wikimedia.org/T395938) [18:53:52] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:54:17] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:12] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:55:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:56:26] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:57:04] I'm here [18:57:09] same [18:57:12] looking [18:57:15] it's my fault [18:57:21] I tried to load a kubernetes dashboard for 90 days [18:57:24] I wish I was joking [18:57:26] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:57:28] sigh [18:57:45] technically not even a kubernetes dashboard, but, the "cluster overview" dashboard for the kuberenetes clsuter in codfw [18:57:47] federico3: lmk if you need a hand [18:57:49] err, ok [18:58:08] bouncing thanos query should hopefully sort it [18:59:18] did it p.age everybody? [18:59:20] Also I'm not oncall, why did I‌ get paged [18:59:27] yeah, that :D [18:59:27] !incidents [18:59:27] 7427 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [18:59:29] I believe so, I also got the page [18:59:30] think so yeah [18:59:31] it somehow went to batphone [18:59:34] maybe we didn't actually switch back from batphone post-summit [19:00:11] yeah per VO it was routed directly to batphone [19:00:24] yes same here got the page. which I recently switched to airhorn.wav and it made me jump out of my seat [19:00:31] heh ok [19:00:41] herron: crank it up to 11 :P [19:00:56] (03CR) 10Ladsgroup: "Yup, we completely shut it down. Is there anything else needed beside deploying this patch?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237504 (https://phabricator.wikimedia.org/T415282) (owner: 10Ladsgroup) [19:00:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:08:42] (03CR) 10Xcollazo: [C:03+2] EventStreamConfig: Bump product_metrics.web_base* streams to large size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237540 (https://phabricator.wikimedia.org/T416719) (owner: 10Xcollazo) [19:09:32] (03Merged) 10jenkins-bot: EventStreamConfig: Bump product_metrics.web_base* streams to large size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237540 (https://phabricator.wikimedia.org/T416719) (owner: 10Xcollazo) [19:09:56] (03PS2) 10Dzahn: zuul: properly differentiate between zuul and zookeeper certs [puppet] - 10https://gerrit.wikimedia.org/r/1237543 (https://phabricator.wikimedia.org/T395938) [19:10:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [19:15:17] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on wdqs1028 - https://phabricator.wikimedia.org/T416736 (10ops-monitoring-bot) 03NEW [19:29:17] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:35:15] hi ops folks! I see a ton of warnings in logstash about Special:HideBanners setting a cookie while also emitting a cache header. (see https://phabricator.wikimedia.org/T285210 ) [19:37:03] Do you see any risk in simply stopping caching that page? I don't see any db interaction, just config [19:39:37] https://gerrit.wikimedia.org/r/1237569 [19:45:29] ejegg: I don't see any risk -- when the CDN emits that log message, it had also already flagged that response as uncacheable [19:45:43] (03CR) 10Scott French: "Great, thanks for confirming!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1237504 (https://phabricator.wikimedia.org/T415282) (owner: 10Ladsgroup) [19:47:59] thanks for confirming, cdanis! [19:50:25] RESOLVED: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:07:16] (03Abandoned) 10Santiago Faci: Test Kitchen UI: Deploy v.1.1.8 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236277 (https://phabricator.wikimedia.org/T415325) (owner: 10Santiago Faci) [20:08:15] 10ops-codfw, 10ops-eqiad, 06DC-Ops, 06SRE Observability: RAM upgrade availability for Titan hosts - https://phabricator.wikimedia.org/T416741 (10herron) 03NEW [20:34:24] (03CR) 10Dzahn: [C:03+2] zuul: properly differentiate between zuul and zookeeper certs [puppet] - 10https://gerrit.wikimedia.org/r/1237543 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [20:52:13] (03CR) 10Milimetric: [C:03+2] EventStreamConfig: Bump product_metrics.web_base* streams to large size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237540 (https://phabricator.wikimedia.org/T416719) (owner: 10Xcollazo) [20:58:45] (03PS1) 10Ladsgroup: mariadb: Fix extra https:// in tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1237586 [20:59:17] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [21:01:26] (03CR) 10Ladsgroup: [C:03+2] mariadb: Fix extra https:// in tables catalog [puppet] - 10https://gerrit.wikimedia.org/r/1237586 (owner: 10Ladsgroup) [21:09:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:12:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:33] 10ops-eqiad, 06DC-Ops: C1 post move cleanup - https://phabricator.wikimedia.org/T416747 (10VRiley-WMF) 03NEW [21:29:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:35:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11593081 (10wiki_willy) a:03VRiley-WMF [21:37:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11593091 (10VRiley-WMF) Thanks, I will take a look at this. However, checking everything else, the entire move has been completed. I will work on some o... [21:44:09] 10ops-eqiad, 06DC-Ops: pay-lb1001 power supply - https://phabricator.wikimedia.org/T416749 (10VRiley-WMF) 03NEW [22:14:10] 06SRE, 06ServiceOps new, 10Continuous-Integration-Config, 06Release-Engineering-Team (Seen): operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855#11593151 (10hashar) I am pretty sure Kunal filed this task referring to the few unit tests I wrote in `integration/co... [22:21:25] 06SRE, 06ServiceOps new, 10Continuous-Integration-Config, 06Release-Engineering-Team (Seen): operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855#11593175 (10hashar) @bking wrote: > I was unable to get it to work in a virtualenv on my MacOS laptop. Yes MacOS sup... [22:25:24] (03PS1) 10Zabe: Configure Hadoop source for Mostcategories computations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237596 (https://phabricator.wikimedia.org/T413362) [22:26:29] (03PS1) 10Zabe: Use Hadoop for Mostcategories on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237597 (https://phabricator.wikimedia.org/T413362) [22:28:40] (03PS2) 10Zabe: Configure Hadoop source for Mostcategories computations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237596 (https://phabricator.wikimedia.org/T413362) [22:30:42] (03PS3) 10Zabe: Configure Hadoop source for Mostcategories computations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237596 (https://phabricator.wikimedia.org/T413362) [22:37:45] (03PS4) 10Zabe: Configure Hadoop source for Mostcategories computations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1237596 (https://phabricator.wikimedia.org/T413362) [23:14:41] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [23:29:17] FIRING: KubernetesCalicoDown: wikikube-worker2019.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2019.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown